161 41 63

Leandro von Werra

lvwerra

https://github.com/lvwerra

AI & ML interests

NLP and RL

Recent Activity

reacted to lewtun's post with 🔥 2 days ago

This paper (https://huggingface.co/papers/2412.18925) has a really interesting recipe for inducing o1-like behaviour in Llama models: * Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting. * Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases) * Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1 * Use the resulting data for SFT & RL * Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement. Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!

liked a Space 10 days ago

data-agents/jupyter-agent

updated a Space 13 days ago

data-agents/jupyter-agent

View all activity

Articles

LeMaterial: an open source initiative to accelerate materials discovery and research

22 days ago

• 31

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Jun 18, 2024

• 43

StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation

Apr 29, 2024

• 76

Welcome Llama 3 - Meta's new open LLM

Apr 18, 2024

• 281

Constitutional AI with Open LLMs

Feb 1, 2024

• 13

Preference Tuning LLMs with Direct Preference Optimization Methods

Jan 18, 2024

• 40

Welcome Mixtral - a SOTA Mixture of Experts on Hugging Face

Dec 11, 2023

• 11

The N Implementation Details of RLHF with PPO

Oct 24, 2023

• 22

Finetune Stable Diffusion Models with DDPO via TRL

Sep 29, 2023

• 7

Spread Your Wings: Falcon 180B is here

Sep 6, 2023

• 4

Code Llama: Llama 2 learns to code

Aug 25, 2023

• 9

The Falcon has landed in the Hugging Face ecosystem

Jun 5, 2023

• 10

Creating a Coding Assistant with StarCoder

May 9, 2023

• 1

StarCoder: A State-of-the-Art LLM for Code

May 4, 2023

• 38

StackLLaMA: A hands-on guide to train LLaMA with RLHF

Apr 5, 2023

• 22

Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU

Mar 9, 2023

• 35

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Dec 9, 2022

• 122

Evaluating Language Model Bias with 🤗 Evaluate

Oct 24, 2022

• 3

Announcing Evaluation on the Hub

Jun 28, 2022

Organizations

lvwerra's activity

reacted to lewtun's post with 🔥 2 days ago

Post

1704

This paper ( HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (2412.18925)) has a really interesting recipe for inducing o1-like behaviour in Llama models:

* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.

Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!