Text Classification
Transformers
Safetensors
English
llama
text-generation-inference
Inference Endpoints
hamishivi commited on
Commit
66f2706
·
verified ·
1 Parent(s): c4df844

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -1
README.md CHANGED
@@ -22,8 +22,17 @@ This is a 8B reward model used for PPO training trained on the UltraFeedback dat
22
  For more details, read the paper:
23
  [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
24
 
 
25
 
26
- ## .Model description
 
 
 
 
 
 
 
 
27
 
28
  - **Model type:** A reward model trained on UltraFeedback, designed to be used in RLHF training.
29
  - **Language(s) (NLP):** English
 
22
  For more details, read the paper:
23
  [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https://arxiv.org/abs/2406.09279).
24
 
25
+ ## Performance
26
 
27
+ We evaluate the model on [RewardBench](https://github.com/allenai/reward-bench):
28
+
29
+ | Model | Score | Chat | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
30
+ |------------------|-------|-------|-----------|--------|-----------|-------------------------|
31
+ | **[Llama 3 Tulu 2 8b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-8b-uf-mean-rm) (this model)** | 66.3 | 96.6 | 59.4 | 61.4 | 80.7 | |
32
+ | [Llama 3 Tulu 2 70b UF RM](https://huggingface.co/allenai/llama-3-tulu-2-70b-uf-mean-rm) | | | | | | |
33
+
34
+
35
+ ## Model description
36
 
37
  - **Model type:** A reward model trained on UltraFeedback, designed to be used in RLHF training.
38
  - **Language(s) (NLP):** English