natolambert commited on
Commit
65e180d
·
1 Parent(s): b7aaef4
Files changed (1) hide show
  1. src/md.py +15 -5
src/md.py CHANGED
@@ -2,13 +2,23 @@ ABOUT_TEXT = """
2
  We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
3
  A win is when the score for the chosen response is higher than the score for the rejected response.
4
 
 
 
5
  We average over 4 core sections (per prompt weighting):
6
- 1. Chat: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
7
- 2. Chat Hard: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
8
- 3. Safety: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
9
- 4. Code: Includes the code subsets (hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)
 
 
 
 
 
 
 
 
10
 
11
- ## Subset Summary
12
 
13
  Total number of the prompts is: 2538, filtered from 4676.
14
 
 
2
  We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
3
  A win is when the score for the chosen response is higher than the score for the rejected response.
4
 
5
+ ## Overview
6
+
7
  We average over 4 core sections (per prompt weighting):
8
+ 1. **Chat**: Includes the easy chat subsets (alpacaeval-easy, alpacaeval-length, alpacaeval-hard, mt-bench-easy, mt-bench-medium)
9
+ 2. **Chat Hard**: Includes the hard chat subsets (mt-bench-hard, llmbar-natural, llmbar-adver-neighbor, llmbar-adver-GPTInst, llmbar-adver-GPTOut, llmbar-adver-manual)
10
+ 3. **Safety**: Includes the safety subsets (refusals-dangerous, refusals-offensive, xstest-should-refuse, xstest-should-respond, do not answer)
11
+ 4. **Code**: Includes the code subsets (hep-cpp, hep-go, hep-java, hep-js, hep-python, hep-rust)
12
+
13
+ We include multiple types of reward models in this evaluation:
14
+ 1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
15
+ 2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
16
+ 3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed.
17
+ 4. **Random**: Random choice baseline.
18
+
19
+ Others, such as **Generative Judge** are coming soon.
20
 
21
+ ### Subset Details
22
 
23
  Total number of the prompts is: 2538, filtered from 4676.
24