Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
akhaliq 
posted an update Apr 16, 2024
Post
4146
Learn Your Reference Model for Real Good Alignment

Learn Your Reference Model for Real Good Alignment (2404.09656)

The complexity of the alignment problem stems from the fact that existing methods are unstable. Researchers continuously invent various tricks to address this shortcoming. For instance, in the fundamental Reinforcement Learning From Human Feedback (RLHF) technique of Language Model alignment, in addition to reward maximization, the Kullback-Leibler divergence between the trainable policy and the SFT policy is minimized. This addition prevents the model from being overfitted to the Reward Model (RM) and generating texts that are out-of-domain for the RM. The Direct Preference Optimization (DPO) method reformulates the optimization task of RLHF and eliminates the Reward Model while tacitly maintaining the requirement for the policy to be close to the SFT policy. In our paper, we argue that this implicit limitation in the DPO method leads to sub-optimal results. We propose a new method called Trust Region DPO (TR-DPO), which updates the reference policy during training. With such a straightforward update, we demonstrate the effectiveness of TR-DPO against DPO on the Anthropic HH and TLDR datasets. We show that TR-DPO outperforms DPO by up to 19%, measured by automatic evaluation with GPT-4. The new alignment approach that we propose allows us to improve the quality of models across several parameters at once, such as coherence, correctness, level of detail, helpfulness, and harmlessness.

The real 800lb gorilla in the room for bias and alignment is GiGo. These models are almost, if not all trained from "the pile," an unvetted consortium of data. Garbage in, Garbage out doesn't suddenly take a vacation because the topic is Artificial Intelligence. All the digital acrobatics in the universe cannot solve the Alignment and Bias issue because the bias is in the proverbial "seed of the plant." It doesn't matter how much "good" data you plan on pouring on top of bad data, "Band-aids," never treat the cause they only obscure the symptom.

What needs to happen is better fine tuned, vetted, authentic, training data that doesn't contain trash like insane social media rants, blatant disinformation and misinformation, vulgar and pornographic material, etc. If you want AI that doesn't need constant alignment and bias training because it goes off the rails every few days, you are going to have to address the 800lb gorilla at some point regardless of the monetary or man-power costs. Otherwise, humanity is going to be undone by GiGo.

In this post