arxiv:2404.19733

Iterative Reasoning Preference Optimization

Published on Apr 30, 2024

· Submitted by

akhaliq on May 1, 2024

Authors:

Richard Yuanzhe Pang ,

Weizhe Yuan ,

Kyunghyun Cho ,

,

Sainbayar Sukhbaatar ,

Jason Weston

Abstract

Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy for Llama-2-70B-Chat from 55.6% to 81.6% on GSM8K (and 88.7% with majority voting out of 32 samples), from 12.5% to 20.8% on MATH, and from 77.8% to 86.7% on ARC-Challenge, which outperforms other Llama-2-based models not relying on additionally sourced datasets.

View arXiv page View PDF Add to collection

Community

May 1, 2024

Very similar to my Memphis approach: https://huggingface.co/euclaise/Memphis-CoT-3B

I also experimented with an extra loss over the gold response, but ended up dropping it.

May 2, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Jun 9, 2024

Boosting AI Reasoning: Unraveling Iterative RPO for Better Logic

Links 🔗:

👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/

By Arxflix

Sep 25, 2024

For the first term according to the equation, it seems they include the prompt in the loss as well? Seems a bit counter intuitive.

·

yzpang

Paper author Sep 25, 2024

Hi! Thanks for noticing this issue. We updated the paper a few days after arXiv v1: https://arxiv.org/pdf/2404.19733

Nov 25, 2024

When you refine the reference model for the iterative reasoning, do you re-initialize the optimizer and learning rate scheduler?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2404.19733 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.19733 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.19733 in a Space README.md to link it from this page.

Collections including this paper 17