arxiv:2501.00911

Aligning LLMs with Domain Invariant Reward Models

Published on Jan 1

Upvote

Authors:

David Wu ,

Sanjiban Choudhury

Abstract

Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey domain-agnostic concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: 0.621 rightarrow 0.661), (2) Clean-to-noisy (accuracy: 0.671 rightarrow 0.703), (3) Few-shot-to-full transfer (accuracy: 0.845 rightarrow 0.920), and (4) Simple-to-complex tasks transfer (correlation: 0.508 rightarrow 0.556). Our code, models and data are available at https://github.com/portal-cornell/dial.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.00911 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.00911 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.00911 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.