Text Generation
Transformers
PyTorch
English
llama
conversational
text-generation-inference
Inference Endpoints
hamishivi commited on
Commit
67706ba
·
verified ·
1 Parent(s): 207011d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -7,7 +7,7 @@ datasets:
7
  - allenai/tulu-v2-sft-mixture
8
  language:
9
  - en
10
- base_model: allenai/tulu-v2.5-13b-preference-mix-rm
11
  license: apache-2.0
12
  ---
13
  <center>
@@ -18,9 +18,9 @@ license: apache-2.0
18
 
19
  Tulu is a series of language models that are trained to act as helpful assistants.
20
  Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
21
- This model is trained on the UltraFeedback dataset (using the per-aspect/fine-grained scores for deciding chosen and rejected) using PPO.
22
- It was initialised from the [Tulu v2.5 13B preference mixture RM](https://huggingface.co/allenai/tulu-v2.5-13b-preference-mix-rm).
23
- We used a 13B RM trained on our preference data mix, and then used the UltraFeedback prompts during PPO training.
24
 
25
  For more details, read the paper:
26
  [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).
 
7
  - allenai/tulu-v2-sft-mixture
8
  language:
9
  - en
10
+ base_model: allenai/tulu-2-13b
11
  license: apache-2.0
12
  ---
13
  <center>
 
18
 
19
  Tulu is a series of language models that are trained to act as helpful assistants.
20
  Tulu V2.5 is a series of models trained using DPO and PPO starting from the [Tulu 2 suite](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
21
+ This model is trained using PPO.
22
+ The reward model used during training was the [Tulu v2.5 13B preference mixture RM](https://huggingface.co/allenai/tulu-v2.5-13b-preference-mix-rm).
23
+ We then used UltraFeedback prompts during PPO training.
24
 
25
  For more details, read the paper:
26
  [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://link.todo).