LLM Reasoning Papers
improve reasoning capabilities of LLMs
Paper • 2305.20050 • Published • 10Note 1. Ilya Sutskever, OpenAI-2023.1 A Survey of Reasoning with Foundation Models https://arxiv.org/abs/2312.11562 Learning to reason with LLMs - introduce o1 https://openai.com/index/learning-to-reason-with-llms/ Deliberative alignment: reasoning enables safer language models https://openai.com/index/deliberative-alignment/ 通过更好的reward model, 经由更多的训练数据,更好的强化学习流程,构建一个具有内在推理链的模型。
LLM Critics Help Catch LLM Bugs
Paper • 2407.00215 • PublishedNote 2. OpenAI-2024.6
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Paper • 2407.21787 • Published • 12Note 3. Google DeepMind-2024.9
Generative Verifiers: Reward Modeling as Next-Token Prediction
Paper • 2408.15240 • Published • 13Note 4. Google DeepMind-2024.8 selfplay的关键是 generator 和 verifier 的对抗,verifier 可以使用 LLM-as-a-judge的模式,verifier 要和 generator 同时更新。论文提出了一种增强生成式 RM 作为 Verifier 的方法. GenRM-CoT在传统 RewardModel(GenRM)的 verify过程中加了 CoT思想,不仅评估解决方案的正确性,而且还通过生成中间推理步骤来详细解释为什么解决方案是正确或错误的。
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Paper • 2408.03314 • Published • 54Note 5. Google DeepMind-2024.8 基座模型对推理模型的影响, 提出了 3 中 PRM 的搜索方法:Best-of-N Weighted Search, Beam Search, Lookahead Search https://arxiv.org/abs/2211.14275 Solving math word problems with process- and outcome-based feedback 论文比较了 ORM和 PRM,结论是效果相差不大,现在看结论不太 solid
STaR: Bootstrapping Reasoning With Reasoning
Paper • 2203.14465 • Published • 8Note 6. Google-2022.3
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Paper • 2403.09629 • Published • 76Note 7. Stanford-2024.3
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Paper • 2408.07199 • Published • 21V-STaR: Training Verifiers for Self-Taught Reasoners
Paper • 2402.06457 • Published • 9Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning
Paper • 2406.12050 • Published • 19Self-Reflection in LLM Agents: Effects on Problem-Solving Performance
Paper • 2405.06682 • Published • 3Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue
Paper • 2311.07445 • PublishedReinforced Self-Training (ReST) for Language Modeling
Paper • 2308.08998 • Published • 2Self-Rewarding Language Models
Paper • 2401.10020 • Published • 146Distilling System 2 into System 1
Paper • 2407.06023 • Published • 3Chain of Thought Empowers Transformers to Solve Inherently Serial Problems
Paper • 2402.12875 • Published • 13An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Paper • 2408.00724 • Published • 1Training Language Models to Self-Correct via Reinforcement Learning
Paper • 2409.12917 • Published • 136
Large Language Models Can Self-Improve
Paper • 2210.11610 • PublishedNote Google/han jiawei-2022
Large Language Models are Zero-Shot Reasoners
Paper • 2205.11916 • Published • 1
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Paper • 2410.05229 • Published • 22Note Apple-LLM's cannot reason-2024.10
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning
Paper • 2410.02884 • Published • 53
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning
Paper • 2410.02089 • Published • 12Note Meta-Code RL- 2024.10
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 105
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Paper • 2411.14405 • Published • 58Note Marco-o1 - Alibaba
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Paper • 2501.04519 • Published • 232The Lessons of Developing Process Reward Models in Mathematical Reasoning
Paper • 2501.07301 • Published • 80
Titans: Learning to Memorize at Test Time
Paper • 2501.00663 • Published • 12Note Google Introduces a neural long-term memory module to memorize historical context and help attention to attend to the current context while utilizing long past information. -> Faster training and maintaining fast inference. The neural memory module acts as a long-term, more persistent memory than just using attention alone (considered more short-term). > based on neural memory, shows good results in LLM, common-sense reasoning, genomics, time series task. scale to larger than 2M context windw
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Paper • 2201.11903 • Published • 9Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
Paper • 2305.15408 • Published
Prover-Verifier Games improve legibility of LLM outputs
Paper • 2407.13692 • Published • 1Note OpenAI
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
Paper • 1712.01815 • PublishedNote AlphaGo Zero Paper-DeepMind Self-Play and MCTS
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
Paper • 2406.14532 • PublishedNote Google, Verifier判别的负例使用,在强化学习中加入负例可以有效的提升 LLM 的推理强度,数据利用效率更是达到了仅用正例的 8 倍。
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Paper • 2410.08146 • PublishedNote Google DeepMind
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Paper • 2406.06592 • Published • 27Note Google DeepMind 为了解决标注昂贵的问题,用基于 MCTS 的方式来自动化判断输出过程哪里出了问题,来生成训练数据训练过程奖励模型 Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision https://arxiv.org/abs/2402.02658 Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations https://arxiv.org/abs/2312.08935
Scaling Laws for Reward Model Overoptimization
Paper • 2210.10760 • PublishedNote OpenAI: 奖励模型效果、奖励模型大小、数据量之间的关系 Rule Based Rewards for Language Model Safety https://arxiv.org/abs/2411.01111 介绍如何将规则应用于训练数据产出一个奖励模型,这个奖励模型参与后续的policy model的强化学习过程。论文观点:通过数据学习,LLM可以学习到reward model所表征的逻辑,无论这个逻辑是来自于人类的标注数据,还是来自于逻辑规则。同时进一步的思考点在于,将规则应用在prompt中,和通过 规则-reward model-policy model 训练的强化学习流程,是否存在差别。
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
Paper • 2410.09918 • Published • 3Note Meta: 论文的重点在于通过训练集的设置,利用模型对于模式(pattern)的学习和识别,可以让模型自主选择是否展开cot, 提供了一个非常好的实践. 主要思路是对reasoning的过程 做一定程度的裁剪,来让模型在复杂问题的推理过程中,能有通过类似直觉的快速思考能力,也就是在一个system2 的思考模式中 引入了 system1 的快速跳过能力。该思路有来自于两个观测:一个是search transformer (用完整的trace做训练的)在inference 阶段, 会产生比a*更短的推理过程。 第二个是他认为人类在某些pattern下会产生 直觉/短路,也就是人思考是混合了系统1/系统2.