LLM Reasoning Papers - a zk67 Collection

Let's Verify Step by Step

Paper • 2305.20050 • Published May 31, 2023 • 10

Note 1. Ilya Sutskever, OpenAI-2023.1 A Survey of Reasoning with Foundation Models https://arxiv.org/abs/2312.11562 Learning to reason with LLMs - introduce o1 https://openai.com/index/learning-to-reason-with-llms/ Deliberative alignment: reasoning enables safer language models https://openai.com/index/deliberative-alignment/ 通过更好的reward model, 经由更多的训练数据，更好的强化学习流程，构建一个具有内在推理链的模型。

LLM Critics Help Catch LLM Bugs

Paper • 2407.00215 • Published Jun 28, 2024

Note 2. OpenAI-2024.6

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Paper • 2407.21787 • Published Jul 31, 2024 • 12

Note 3. Google DeepMind-2024.9

Generative Verifiers: Reward Modeling as Next-Token Prediction

Paper • 2408.15240 • Published Aug 27, 2024 • 13

Note 4. Google DeepMind-2024.8 selfplay的关键是 generator 和 verifier 的对抗，verifier 可以使用 LLM-as-a-judge的模式，verifier 要和 generator 同时更新。论文提出了一种增强生成式 RM 作为 Verifier 的方法. GenRM-CoT在传统 RewardModel(GenRM)的 verify过程中加了 CoT思想，不仅评估解决方案的正确性，而且还通过生成中间推理步骤来详细解释为什么解决方案是正确或错误的。

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Paper • 2408.03314 • Published Aug 6, 2024 • 54

Note 5. Google DeepMind-2024.8 基座模型对推理模型的影响，提出了 3 中 PRM 的搜索方法：Best-of-N Weighted Search, Beam Search, Lookahead Search https://arxiv.org/abs/2211.14275 Solving math word problems with process- and outcome-based feedback 论文比较了 ORM和 PRM，结论是效果相差不大，现在看结论不太 solid

STaR: Bootstrapping Reasoning With Reasoning

Paper • 2203.14465 • Published Mar 28, 2022 • 8

Note 6. Google-2022.3

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Paper • 2403.09629 • Published Mar 14, 2024 • 76

Note 7. Stanford-2024.3

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Paper • 2408.07199 • Published Aug 13, 2024 • 21

V-STaR: Training Verifiers for Self-Taught Reasoners

Paper • 2402.06457 • Published Feb 9, 2024 • 9

Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning

Paper • 2406.12050 • Published Jun 17, 2024 • 19

Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

Paper • 2405.06682 • Published May 5, 2024 • 3

Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue

Paper • 2311.07445 • Published Nov 13, 2023

An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Paper • 2408.00724 • Published Aug 1, 2024 • 1

Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19, 2024 • 136

Large Language Models Can Self-Improve

Paper • 2210.11610 • Published Oct 20, 2022

Note Google/han jiawei-2022

Large Language Models are Zero-Shot Reasoners

Paper • 2205.11916 • Published May 24, 2022 • 1

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Paper • 2410.05229 • Published Oct 7, 2024 • 22

Note Apple-LLM's cannot reason-2024.10

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

Paper • 2410.02884 • Published Oct 3, 2024 • 53

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Paper • 2410.02089 • Published Oct 2, 2024 • 12

Note Meta-Code RL- 2024.10

Chain-of-Thought Reasoning Without Prompting

Paper • 2402.10200 • Published Feb 15, 2024 • 105

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Paper • 2411.14405 • Published Nov 21, 2024 • 58

Note Marco-o1 - Alibaba

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Paper • 2501.04519 • Published 11 days ago • 232

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Paper • 2501.07301 • Published 6 days ago • 80

Titans: Learning to Memorize at Test Time

Paper • 2501.00663 • Published 19 days ago • 12

Note Google Introduces a neural long-term memory module to memorize historical context and help attention to attend to the current context while utilizing long past information. -> Faster training and maintaining fast inference. The neural memory module acts as a long-term, more persistent memory than just using attention alone (considered more short-term). > based on neural memory, shows good results in LLM, common-sense reasoning, genomics, time series task. scale to larger than 2M context windw

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Paper • 2201.11903 • Published Jan 28, 2022 • 9

Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective

Paper • 2305.15408 • Published May 24, 2023

Prover-Verifier Games improve legibility of LLM outputs

Paper • 2407.13692 • Published Jul 18, 2024 • 1

Note OpenAI

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Paper • 1712.01815 • Published Dec 5, 2017

Note AlphaGo Zero Paper-DeepMind Self-Play and MCTS

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Paper • 2406.14532 • Published Jun 20, 2024

Note Google, Verifier判别的负例使用，在强化学习中加入负例可以有效的提升 LLM 的推理强度，数据利用效率更是达到了仅用正例的 8 倍。

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Paper • 2410.08146 • Published Oct 10, 2024

Note Google DeepMind

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Paper • 2406.06592 • Published Jun 5, 2024 • 27

Note Google DeepMind 为了解决标注昂贵的问题，用基于 MCTS 的方式来自动化判断输出过程哪里出了问题，来生成训练数据训练过程奖励模型 Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision https://arxiv.org/abs/2402.02658 Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations https://arxiv.org/abs/2312.08935

Scaling Laws for Reward Model Overoptimization

Paper • 2210.10760 • Published Oct 19, 2022

Note OpenAI: 奖励模型效果、奖励模型大小、数据量之间的关系 Rule Based Rewards for Language Model Safety https://arxiv.org/abs/2411.01111 介绍如何将规则应用于训练数据产出一个奖励模型，这个奖励模型参与后续的policy model的强化学习过程。论文观点：通过数据学习，LLM可以学习到reward model所表征的逻辑，无论这个逻辑是来自于人类的标注数据，还是来自于逻辑规则。同时进一步的思考点在于，将规则应用在prompt中，和通过规则-reward model-policy model 训练的强化学习流程，是否存在差别。

Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

Paper • 2410.09918 • Published Oct 13, 2024 • 3

Note Meta: 论文的重点在于通过训练集的设置，利用模型对于模式（pattern）的学习和识别，可以让模型自主选择是否展开cot，提供了一个非常好的实践. 主要思路是对reasoning的过程做一定程度的裁剪，来让模型在复杂问题的推理过程中，能有通过类似直觉的快速思考能力,也就是在一个system2 的思考模式中引入了 system1 的快速跳过能力。该思路有来自于两个观测：一个是search transformer (用完整的trace做训练的)在inference 阶段，会产生比a*更短的推理过程。第二个是他认为人类在某些pattern下会产生直觉/短路，也就是人思考是混合了系统1/系统2.