great
CoffeeBliss
CoffeeBliss
AI & ML interests
None yet
Recent Activity
liked
a model
7 days ago
bartowski/HuatuoGPT-o1-8B-GGUF
Organizations
None yet
CoffeeBliss's activity
reacted to
lewtun's
post with 🔥
5 days ago
Post
1986
This paper (
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (2412.18925)) has a really interesting recipe for inducing o1-like behaviour in Llama models:
* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.
Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!
* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.
Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!
question for bartowski
2
#15 opened 7 days ago
by
CoffeeBliss
upvoted
a
collection
7 days ago
upvoted
a
collection
12 days ago