allenai
/

tulu-v2.5-ppo-13b-uf-mean-13b-mix-rm

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

hamishivi commited on Jun 12, 2024

Commit

67706ba

·

verified ·

1 Parent(s): 207011d

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -7,7 +7,7 @@ datasets:
 - allenai/tulu-v2-sft-mixture
 language:
 - en
-base_model: allenai/tulu-v2.5-13b-preference-mix-rm
 license: apache-2.0
 ---
 <center>
@@ -18,9 +18,9 @@ license: apache-2.0
 Tulu is a series of language models that are trained to act as helpful assistants.
 Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
-This model is trained on the UltraFeedback dataset (using the per-aspect/fine-grained scores for deciding chosen and rejected) using PPO.
-It was initialised from the [Tulu v2.5 13B preference mixture RM](https://huggingface.co/allenai/tulu-v2.5-13b-preference-mix-rm).
-We used a 13B RM trained on our preference data mix, and then used the UltraFeedback prompts during PPO training.
 For more details, read the paper:
 [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).

 - allenai/tulu-v2-sft-mixture
 language:
 - en
+base_model: allenai/tulu-2-13b
 license: apache-2.0
 ---
 <center>
 Tulu is a series of language models that are trained to act as helpful assistants.
 Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
+This model is trained using PPO.
+The reward model used during training was the [Tulu v2.5 13B preference mixture RM](https://huggingface.co/allenai/tulu-v2.5-13b-preference-mix-rm).
+We then used UltraFeedback prompts during PPO training.
 For more details, read the paper:
 [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).