Papers
arxiv:2401.12187

WARM: On the Benefits of Weight Averaged Reward Models

Published on Jan 22, 2024
· Submitted by akhaliq on Jan 23, 2024
Authors:
,
,
,
,
,

Abstract

Aligning large language models (LLMs) with human preferences through reinforcement learning (RLHF) can lead to reward hacking, where LLMs exploit failures in the reward model (RM) to achieve seemingly high rewards without meeting the underlying objectives. We identify two primary challenges when designing RMs to mitigate reward hacking: distribution shifts during the RL process and inconsistencies in human preferences. As a solution, we propose Weight Averaged Reward Models (WARM), first fine-tuning multiple RMs, then averaging them in the weight space. This strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. By averaging weights, WARM improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions; for example, a policy RL fine-tuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.

Community

How is the "Control Reward" calculated?
I was unable to locate a definition for this term.

Hi, author here. Thank you for the question. The answer is in the first paragraph of Section 5, where we state: "we leverage a PaLM-XS RM for pointwise control reward reaching 80.1% accuracy on the OOD dataset. As verified in our experiments, this control RM also detects hacking, as it benefits from a larger architecture and a disjoint pretraining compared to the PaLM-XXS RMs of interest". In other words, the control reward is also a RM, trained on the same dataset, but with larger architecture and different pretraining. This pointwise control reward enables plotting absolute scores, and the observations are consistent with the "pairwise oracle preference metric".

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

How is the "pairwise oracle preference metric" calculated?

Paper author

The pairwise oracle preference metric is described in the first paragraph of Section 5, and further detailed in Appendix B.2. Roughly, we follow the best AI labelling approach from RLAIF https://arxiv.org/abs/2309.00267.

So in Figure 7, the win or loss is labled by the PaLM-XS RM (i.e. control reward) instead of gpt4 or human?

Paper author

Thanks for the question. No actually in Figure 7 the win rate is computed with the oracle preference metric, i.e., the AI labelling approach from RLAIF with a "PaLM-L model prompted with chain-of-thought".

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.12187 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.12187 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.12187 in a Space README.md to link it from this page.

Collections including this paper 10