allenai
/

llama-3-tulu-2-8b-uf-mean-rm

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

hamishivi commited on Jun 21, 2024

Commit

66f2706

·

verified ·

1 Parent(s): c4df844

Update README.md

Files changed (1) hide show

README.md +10 -1

README.md CHANGED Viewed

@@ -22,8 +22,17 @@ This is a 8B reward model used for PPO training trained on the UltraFeedback dat
 For more details, read the paper:
 [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
-## .Model description
 - **Model type:** A reward model trained on UltraFeedback, designed to be used in RLHF training.
 - **Language(s) (NLP):** English

 For more details, read the paper:
 [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
+## Performance
+We evaluate the model on [RewardBench](https://github.com/allenai/reward-bench):
+| Model            | Score | Chat  | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
+|------------------|-------|-------|-----------|--------|-----------|-------------------------|
+| **[Llama 3 Tulu 2 8b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-8b-uf-mean-rm) (this model)**  | 66.3  | 96.6  |    59.4   |  61.4  |    80.7   |                         |
+| [Llama 3 Tulu 2 70b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-70b-uf-mean-rm) |       |       |           |        |           |                         |
+## Model description
 - **Model type:** A reward model trained on UltraFeedback, designed to be used in RLHF training.
 - **Language(s) (NLP):** English