natolambert
commited on
Update src/md.py
Browse files
src/md.py
CHANGED
@@ -20,22 +20,13 @@ Once all subsets weighted averages are achieved, the final RewardBench score is
|
|
20 |
We include multiple types of reward models in this evaluation:
|
21 |
1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
|
22 |
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
|
23 |
-
3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed.
|
24 |
4. **Random**: Random choice baseline.
|
25 |
4. **Generative**: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval.
|
26 |
|
27 |
All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
|
28 |
Others, such as **Generative Judge** are coming soon.
|
29 |
|
30 |
-
### Model Types
|
31 |
-
|
32 |
-
Currently, we evaluate the following model types:
|
33 |
-
1. **Sequence Classifiers**: A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
|
34 |
-
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
|
35 |
-
3. **DPO**: Models trained with Direct Preference Optimization (DPO) with a reference model being either the base or supervised fine-tuning checkpoint.
|
36 |
-
|
37 |
-
Support of DPO models without a reference model is coming soon.
|
38 |
-
|
39 |
### Subset Details
|
40 |
|
41 |
Total number of the prompts is: 2985, filtered from 5123.
|
|
|
20 |
We include multiple types of reward models in this evaluation:
|
21 |
1. **Sequence Classifiers** (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.
|
22 |
2. **Custom Classifiers**: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).
|
23 |
+
3. **DPO**: Models trained with Direct Preference Optimization (DPO), with modifiers such as `-ref-free` or `-norm` changing how scores are computed. *Note*: This also includes other models trained with implicit rewards, such as those trained with [KTO](https://arxiv.org/abs/2402.01306).
|
24 |
4. **Random**: Random choice baseline.
|
25 |
4. **Generative**: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval.
|
26 |
|
27 |
All models are evaluated in fp16 expect for Starling-7B, which is evaluated in fp32.
|
28 |
Others, such as **Generative Judge** are coming soon.
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
### Subset Details
|
31 |
|
32 |
Total number of the prompts is: 2985, filtered from 5123.
|