Update README.md
Browse files
README.md
CHANGED
@@ -22,8 +22,17 @@ This is a 8B reward model used for PPO training trained on the UltraFeedback dat
|
|
22 |
For more details, read the paper:
|
23 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
24 |
|
|
|
25 |
|
26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
|
28 |
- **Model type:** A reward model trained on UltraFeedback, designed to be used in RLHF training.
|
29 |
- **Language(s) (NLP):** English
|
|
|
22 |
For more details, read the paper:
|
23 |
[Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
|
24 |
|
25 |
+
## Performance
|
26 |
|
27 |
+
We evaluate the model on [RewardBench](https://github.com/allenai/reward-bench):
|
28 |
+
|
29 |
+
| Model | Score | Chat | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
|
30 |
+
|------------------|-------|-------|-----------|--------|-----------|-------------------------|
|
31 |
+
| **[Llama 3 Tulu 2 8b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-8b-uf-mean-rm) (this model)** | 66.3 | 96.6 | 59.4 | 61.4 | 80.7 | |
|
32 |
+
| [Llama 3 Tulu 2 70b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-70b-uf-mean-rm) | | | | | | |
|
33 |
+
|
34 |
+
|
35 |
+
## Model description
|
36 |
|
37 |
- **Model type:** A reward model trained on UltraFeedback, designed to be used in RLHF training.
|
38 |
- **Language(s) (NLP):** English
|