jacobmorrison commited on
Commit
2b14846
·
verified ·
1 Parent(s): b8600e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -211
README.md CHANGED
@@ -1,215 +1,181 @@
1
  ---
2
- language: en
3
- model-index:
4
- - name: allenai/open_instruct_dev
5
- results:
6
- - task:
7
- type: preference_evaluation
8
- dataset:
9
- name: reward-bench
10
- type: allenai/reward-bench
11
- metrics:
12
- - type: accuracy
13
- value: 1.0
14
- - type: accuracy
15
- value: 1.0
16
- - type: accuracy
17
- value: 1.0
18
- - type: accuracy
19
- value: 1.0
20
  ---
21
 
22
- # Model Card for allenai/open_instruct_dev
23
 
24
- <!-- Provide a quick summary of what the model is/does. -->
25
-
26
-
27
-
28
- ## Model Details
29
-
30
- ### Model Description
31
-
32
- <!-- Provide a longer summary of what this model is. -->
33
-
34
-
35
-
36
- - **Developed by:** [More Information Needed]
37
- - **Funded by [optional]:** [More Information Needed]
38
- - **Shared by [optional]:** [More Information Needed]
39
- - **Model type:** [More Information Needed]
40
- - **Language(s) (NLP):** en
41
- - **License:** [More Information Needed]
42
- - **Finetuned from model [optional]:** [More Information Needed]
43
-
44
- ### Model Sources [optional]
45
-
46
- <!-- Provide the basic links for the model. -->
47
-
48
- - **Repository:** [More Information Needed]
49
- - **Paper [optional]:** [More Information Needed]
50
- - **Demo [optional]:** [More Information Needed]
51
-
52
- ## Uses
53
-
54
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
55
-
56
- ### Direct Use
57
-
58
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
59
-
60
- [More Information Needed]
61
-
62
- ### Downstream Use [optional]
63
-
64
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
65
-
66
- [More Information Needed]
67
-
68
- ### Out-of-Scope Use
69
-
70
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
71
-
72
- [More Information Needed]
73
-
74
- ## Bias, Risks, and Limitations
75
-
76
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
77
-
78
- [More Information Needed]
79
-
80
- ### Recommendations
81
-
82
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
83
-
84
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
85
-
86
- ## How to Get Started with the Model
87
-
88
- Use the code below to get started with the model.
89
-
90
- [More Information Needed]
91
-
92
- ## Training Details
93
-
94
- ### Training Data
95
-
96
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
97
-
98
- [More Information Needed]
99
-
100
- ### Training Procedure
101
-
102
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
103
-
104
- #### Preprocessing [optional]
105
-
106
- [More Information Needed]
107
-
108
-
109
- #### Training Hyperparameters
110
-
111
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
112
-
113
- #### Speeds, Sizes, Times [optional]
114
-
115
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
116
-
117
- [More Information Needed]
118
-
119
- ## Evaluation
120
-
121
- <!-- This section describes the evaluation protocols and provides the results. -->
122
-
123
- ### Testing Data, Factors & Metrics
124
-
125
- #### Testing Data
126
-
127
- <!-- This should link to a Dataset Card if possible. -->
128
-
129
- [More Information Needed]
130
-
131
- #### Factors
132
-
133
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
134
-
135
- [More Information Needed]
136
-
137
- #### Metrics
138
-
139
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
140
-
141
- [More Information Needed]
142
-
143
- ### Results
144
-
145
- [More Information Needed]
146
-
147
- #### Summary
148
-
149
-
150
-
151
- ## Model Examination [optional]
152
-
153
- <!-- Relevant interpretability work for the model goes here -->
154
-
155
- [More Information Needed]
156
-
157
- ## Environmental Impact
158
-
159
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
160
-
161
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
162
-
163
- - **Hardware Type:** [More Information Needed]
164
- - **Hours used:** [More Information Needed]
165
- - **Cloud Provider:** [More Information Needed]
166
- - **Compute Region:** [More Information Needed]
167
- - **Carbon Emitted:** [More Information Needed]
168
-
169
- ## Technical Specifications [optional]
170
-
171
- ### Model Architecture and Objective
172
-
173
- [More Information Needed]
174
-
175
- ### Compute Infrastructure
176
-
177
- [More Information Needed]
178
-
179
- #### Hardware
180
-
181
- [More Information Needed]
182
-
183
- #### Software
184
-
185
- [More Information Needed]
186
-
187
- ## Citation [optional]
188
-
189
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
190
-
191
- **BibTeX:**
192
-
193
- [More Information Needed]
194
-
195
- **APA:**
196
-
197
- [More Information Needed]
198
-
199
- ## Glossary [optional]
200
-
201
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
202
-
203
- [More Information Needed]
204
-
205
- ## More Information [optional]
206
-
207
- [More Information Needed]
208
-
209
- ## Model Card Authors [optional]
210
-
211
- [More Information Needed]
212
-
213
- ## Model Card Contact
214
-
215
- [More Information Needed]
 
1
  ---
2
+ license: llama3.1
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ datasets:
7
+ - allenai/llama-3.1-tulu-3-70B-preference-mixture
8
+ base_model:
9
+ - allenai/Llama-3.1-Tulu-3-70B-SFT
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
+ <img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu3/Tulu3-logo.png" alt="Tulu 3 banner" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
13
 
14
+ # Llama-3.1-Tulu-3-70B-DPO
15
+
16
+ Tülu3 is a leading instruction following model family, offering fully open-source data, code, and recipes designed to serve as a comprehensive guide for modern post-training techniques.
17
+ Tülu3 is designed for state-of-the-art performance on a diversity of tasks in addition to chat, such as MATH, GSM8K, and IFEval.
18
+
19
+ ## Model description
20
+
21
+ - **Model type:** A model trained on a mix of publicly available, synthetic and human-created datasets.
22
+ - **Language(s) (NLP):** Primarily English
23
+ - **License:** Llama 3.1 Community License Agreement
24
+ - **Finetuned from model:** allenai/Llama-3.1-Tulu-3-70B-SFT
25
+
26
+ ### Model Sources
27
+
28
+ - **Training Repository:** https://github.com/allenai/open-instruct
29
+ - **Eval Repository:** https://github.com/allenai/olmes
30
+ - **Paper:** https://allenai.org/papers/tulu-3-report.pdf (arXiv soon)
31
+ - **Demo:** https://playground.allenai.org/
32
+
33
+ ### Model Family
34
+
35
+ | **Stage** | **Llama 3.1 8B** | **Llama 3.1 70B** |
36
+ |----------------------|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
37
+ | **Base Model** | [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B) |
38
+ | **SFT** | [allenai/Llama-3.1-Tulu-3-8B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT) | [allenai/Llama-3.1-Tulu-3-70B-SFT](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-SFT) |
39
+ | **DPO** | [allenai/Llama-3.1-Tulu-3-8B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-DPO) | [allenai/Llama-3.1-Tulu-3-70B-DPO](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B-DPO) |
40
+ | **Final Models (RLVR)** | [allenai/Llama-3.1-Tulu-3-8B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B) | [allenai/Llama-3.1-Tulu-3-70B](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B) |
41
+ | **Reward Model (RM)**| [allenai/Llama-3.1-Tulu-3-8B-RM](https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-RM) | (Same as 8B) |
42
+
43
+ ## Using the model
44
+
45
+ ### Loading with HuggingFace
46
+
47
+ To load the model with HuggingFace, use the following snippet:
48
+ ```
49
+ from transformers import AutoModelForCausalLM
50
+
51
+ tulu_model = AutoModelForCausalLM.from_pretrained("allenai/Llama-3.1-Tulu-3-70B-DPO")
52
+ ```
53
+
54
+ ### VLLM
55
+
56
+ As a Llama base model, the model can be easily served with:
57
+ ```
58
+ vllm serve allenai/Llama-3.1-Tulu-3-70B-DPO
59
+ ```
60
+ Note that given the long chat template of Llama, you may want to use `--max_model_len=8192`.
61
+
62
+ ### Chat template
63
+
64
+ The chat template for our models is formatted as:
65
+ ```
66
+ <|user|>\nHow are you doing?\n<|assistant|>\nI'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>
67
+ ```
68
+ Or with new lines expanded:
69
+ ```
70
+ <|user|>
71
+ How are you doing?
72
+ <|assistant|>
73
+ I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>
74
+ ```
75
+ It is embedded within the tokenizer as well, for `tokenizer.apply_chat_template`.
76
+
77
+ ### System prompt
78
+
79
+ In Ai2 demos, we use this system prompt by default:
80
+ ```
81
+ You are Tulu 3, a helpful and harmless AI Assistant built by the Allen Institute for AI.
82
+ ```
83
+ The model has not been trained with a specific system prompt in mind.
84
+
85
+ ### Bias, Risks, and Limitations
86
+
87
+ The Tülu3 models have limited safety training, but are not deployed automatically with in-the-loop filtering of responses like ChatGPT, so the model can produce problematic outputs (especially when prompted to do so).
88
+ It is also unknown what the size and composition of the corpus was used to train the base Llama 3.1 models, however it is likely to have included a mix of Web data and technical sources like books and code.
89
+ See the Falcon 180B model card for an example of this.
90
+
91
+
92
+ ## Performance
93
+
94
+ | Benchmark (eval) | Tülu 3 SFT 8B | Tülu 3 DPO 8B | Tülu 3 8B | Llama 3.1 8B Instruct | Qwen 2.5 7B Instruct | Magpie 8B | Gemma 2 9B Instruct | Ministral 8B Instruct |
95
+ |---------------------------------|----------------|----------------|------------|------------------------|----------------------|-----------|---------------------|-----------------------|
96
+ | **Avg.** | 60.4 | 64.4 | **64.8** | 62.2 | 57.8 | 44.7 | 55.2 | 58.3 |
97
+ | **MMLU (0 shot, CoT)** | 65.9 | 68.7 | 68.2 | 71.2 | **76.6** | 62.0 | 74.6 | 68.5 |
98
+ | **PopQA (15 shot)** | **29.3** | 29.3 | 29.1 | 20.2 | 18.1 | 22.5 | 28.3 | 20.2 |
99
+ | **TruthfulQA (6 shot)** | 46.8 | 56.1 | 55.0 | 55.1 | **63.1** | 57.0 | 61.4 | 55.5 |
100
+ | **BigBenchHard (3 shot, CoT)** | **67.9** | 65.8 | 66.0 | 62.8 | 21.7 | 0.9 | 2.5 | 56.2 |
101
+ | **DROP (3 shot)** | 61.3 | 62.5 | **62.6** | 61.5 | 54.4 | 49.4 | 58.8 | 56.2 |
102
+ | **MATH (4 shot CoT, Flex)** | 31.5 | 42.0 | **43.7** | 42.5 | 14.8 | 5.1 | 29.8 | 40.0 |
103
+ | **GSM8K (8 shot, CoT)** | 76.2 | 84.3 | **87.6** | 83.4 | 83.8 | 61.2 | 79.7 | 80.0 |
104
+ | **HumanEval (pass@10)** | 86.2 | 83.9 | 83.9 | 86.3 | **93.1** | 75.4 | 71.7 | 91.0 |
105
+ | **HumanEval+ (pass@10)** | 81.4 | 78.6 | 79.2 | 82.9 | **89.7** | 69.1 | 67.0 | 88.5 |
106
+ | **IFEval (prompt loose)** | 72.8 | 81.1 | **82.4** | 80.6 | 74.7 | 38.8 | 69.9 | 56.4 |
107
+ | **AlpacaEval 2 (LC % win)** | 12.4 | 33.5 | 34.5 | 24.2 | 29.0 | **49.0** | 43.7 | 31.4 |
108
+ | **Safety (6 task avg.)** | **93.1** | 87.2 | 85.5 | 75.2 | 75.0 | 46.4 | 75.5 | 56.2 |
109
+
110
+ | Benchmark (eval) | Tülu 3 70B SFT | Tülu 3 DPO 70B | Tülu 3 70B | Llama 3.1 70B Instruct | Qwen 2.5 72B Instruct | Hermes 3 Llama 3.1 70B | Nemotron Llama 3.1 70B |
111
+ |---------------------------------|-----------------|-----------------|-------------|-------------------------|-----------------------|------------------------|-------------------------|
112
+ | **Avg.** | 72.6 | 75.9 | **76.0** | 73.4 | 71.5 | 68.3 | 65.5 |
113
+ | **MMLU (0 shot, CoT)** | 78.9 | 83.3 | 83.1 | 85.3 | **85.5** | 80.4 | 83.8 |
114
+ | **PopQA (15 shot)** | **48.6** | 46.3 | 46.5 | 46.4 | 30.6 | 48.1 | 36.4 |
115
+ | **TruthfulQA (6 shot)** | 55.7 | 67.9 | 67.6 | 66.8 | **69.9** | 66.5 | 62.6 |
116
+ | **BigBenchHard (3 shot, CoT)** | **82.7** | 81.8 | 82.0 | 73.8 | 67.2 | 82.1 | 0.7 |
117
+ | **DROP (3 shot)** | **77.2** | 74.1 | 74.3 | 77.0 | 34.2 | 73.2 | 68.8 |
118
+ | **MATH (4 shot CoT, Flex)** | 53.7 | 62.3 | 63.0 | 56.4 | **74.3** | 41.9 | 55.0 |
119
+ | **GSM8K (8 shot, CoT)** | 91.1 | 93.5 | 93.5 | **93.7** | 89.5 | 90.0 | 84.7 |
120
+ | **HumanEval (pass@10)** | 92.9 | 92.4 | 92.4 | 93.6 | 94.0 | 89.6 | **94.1** |
121
+ | **HumanEval+ (pass@10)** | 87.3 | 88.4 | 88.0 | 89.5 | **90.8** | 85.9 | 85.5 |
122
+ | **IFEval (prompt loose)** | 82.1 | 82.6 | 83.2 | **88.0** | 87.6 | 76.0 | 79.9 |
123
+ | **AlpacaEval 2 (LC % win)** | 26.3 | 49.6 | 49.8 | 33.4 | 47.7 | 28.4 | **66.1** |
124
+ | **Safety (6 task avg.)** | **94.4** | 89.0 | 88.3 | 76.5 | 87.0 | 57.9 | 69.0 |
125
+
126
+
127
+ ## Hyperparamters
128
+
129
+ DPO:
130
+ - **Learning Rate**: 5 × 10⁻⁷ (8B), 2.0e-7 (70B)
131
+ - **Learning Rate Schedule**: Linear
132
+ - **Batch Size (effective)**: 32 (8B), 128 (70B)
133
+ - **Max Sequence Length**: 2,048
134
+ - **Epochs**: 1
135
+
136
+ ## License and use
137
+
138
+ All Llama 3.1 Tülu3 models are released under Meta's [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/).
139
+ Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc.
140
+ Tülu3 is intended for research and educational use.
141
+ For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use).
142
+
143
+ The models have been fine-tuned using a dataset mix with outputs generated from third party models and are subject to additional terms:
144
+ [Gemma Terms of Use](https://ai.google.dev/gemma/terms) and [Qwen License Agreement](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE) (models were improved using Qwen 2.5).
145
+
146
+
147
+ ## Citation
148
+
149
+ If Tülu3 or any of the related materials were helpful to your work, please cite:
150
+ ```
151
+ @article{lambert2024tulu3,
152
+ title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training},
153
+ author = {
154
+ Nathan Lambert and
155
+ Jacob Morrison and
156
+ Valentina Pyatkin and
157
+ Shengyi Huang and
158
+ Hamish Ivison and
159
+ Faeze Brahman and
160
+ Lester James V. Miranda and
161
+ Alisa Liu and
162
+ Nouha Dziri and
163
+ Shane Lyu and
164
+ Yuling Gu and
165
+ Saumya Malik and
166
+ Victoria Graf and
167
+ Jena D. Hwang and
168
+ Jiangjiang Yang and
169
+ Ronan Le Bras and
170
+ Oyvind Tafjord and
171
+ Chris Wilhelm and
172
+ Luca Soldaini and
173
+ Noah A. Smith and
174
+ Yizhong Wang and
175
+ Pradeep Dasigi and
176
+ Hannaneh Hajishirzi
177
+ },
178
+ year = {2024},
179
+ email = {tulu@allenai.org}
180
+ }
181
+ ```