Jeronymous commited on
Commit
2c5a377
·
verified ·
1 Parent(s): 8fea09c

Update README.md with info about context extension pretraining phase

Browse files
Files changed (1) hide show
  1. README.md +62 -22
README.md CHANGED
@@ -36,11 +36,17 @@ https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/tem
36
  * [Model Description](#model-description)
37
  <!-- * [Uses](#uses) -->
38
  * [Example code in python](#example-code-in-python)
 
39
  * [Sentence completion](#sentence-completion)
40
  * [Load a checkpoint](#load-a-checkpoint)
41
  * [Training Details](#training-details)
42
  * [Training Data](#training-data)
43
  * [Training Procedure](#training-procedure)
 
 
 
 
 
44
  <!-- * [Evaluation](#evaluation) -->
45
  * [Acknowledgements](#acknowledgements)
46
  * [Contact](#contact)
@@ -61,7 +67,7 @@ as well as several programming languages (14.7%).
61
 
62
  ## Example code in python
63
 
64
- ### Sentence completion
65
 
66
  Load the model (quantized version on GPU if possible, for efficient inference):
67
  ```python
@@ -75,6 +81,7 @@ model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
75
  load_in_4bit=True # For efficient inference, if quantization is supported by the GPU card
76
  )
77
  ```
 
78
 
79
  Wrap the model in a text generation pipeline, and prepare some generation parameters:
80
  ```
@@ -121,14 +128,15 @@ every 5000 steps during the first 25000 steps, and then every 25000 steps.
121
  Intermediate checkpoints can be loaded using the `revision` parameter:
122
  ```python
123
  model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
124
- revision="step0400000",
125
  ...
126
  )
127
  ```
128
  where `revision` can be one of:
129
- * ["`step0005000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000), ["`step0010000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000), ["`step0015000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000), ["`step0020000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000): each 5000 steps for the first pre-training steps.
130
- * ["`step0025000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000), ["`step0050000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000), ["`step0075000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000), ["`step0100000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000), ..., ["`step0750000`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000): each 25000 steps from 25k to 750k steps.
131
- * ["`step0753851`"](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851): last pre-training step before context extension and annealing.
 
132
 
133
  ## Training Details
134
 
@@ -159,43 +167,75 @@ Optimizer checkpoints are available at [OpenLLM-France/Lucie-7B-optimizer-states
159
 
160
  #### Neural Network Architecture
161
 
162
- Lucie-7B has the same neural network architecture as Llama3.
163
  It has exactly 6 706 958 336 free parameters,
164
  with the following hyperparameters:
165
  | **Hyperparameter** | **Value** |
166
  |---------------------------|---------|
167
- | Vocabulary size (\# tokens)| 65 024|
168
- | ROPE theta | 500 000|
169
- | \# transformer blocks | 32|
170
- | \# attention heads | 32|
171
- | \# key-value heads | 8|
172
- | Hidden size | 4 096|
173
- | Feed-Forward hidden size | 12 288|
174
- | Activation | `silu`|
175
- | RMS norm epsilon | 1e-5|
 
 
176
 
177
  #### Training Hyperparameters
178
 
 
 
 
 
 
 
 
 
 
179
  Training hyperparameters in torch/Megatron-DeepSpeed were the following:
180
  | **Hyperparameter** | **Value** |
181
  |------------------------|------------|
182
- | Optimizer | `AdamW` |
183
- | Precision | `bfloat16` |
184
- | Initial batch size | 256 |
185
- | Final batch size | 1024 |
 
 
186
  | Batch size rampup | by steps of 64 over 10M samples |
187
- | Context length | 4096 |
188
- | Learning rate schedule | warmup + cosine annealing |
189
  | Maximum Learning rate | 3e-4 |
190
  | Final Learning rate | 3e-5 |
191
  | Weight decay | 0.1 |
192
  | Dropout | _ |
193
  | Gradient clipping | 1 |
194
- | Initializer range | 0.2 |
 
 
195
  | Tensor Parallelism (with 512 GPUs) | 4 |
196
  | Pipeline Parallelism (with 512 GPUs) | 4 |
197
  | Data Parallelism (with 512 GPUs) | 32 |
198
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
 
200
  ## Acknowledgements
201
 
 
36
  * [Model Description](#model-description)
37
  <!-- * [Uses](#uses) -->
38
  * [Example code in python](#example-code-in-python)
39
+ * [Load the model](#load-the-model)
40
  * [Sentence completion](#sentence-completion)
41
  * [Load a checkpoint](#load-a-checkpoint)
42
  * [Training Details](#training-details)
43
  * [Training Data](#training-data)
44
  * [Training Procedure](#training-procedure)
45
+ * [Neural Network Architecture](#neural-network-architecture)
46
+ * [Training Hyperparameters](#training-hyperparameters)
47
+ 1. [Main pre-training](#1-main-pre-training)
48
+ 2. [Context Extension](#2-context-extension)
49
+ 3. [Annealing](#3-annealing)
50
  <!-- * [Evaluation](#evaluation) -->
51
  * [Acknowledgements](#acknowledgements)
52
  * [Contact](#contact)
 
67
 
68
  ## Example code in python
69
 
70
+ ### Load the model
71
 
72
  Load the model (quantized version on GPU if possible, for efficient inference):
73
  ```python
 
81
  load_in_4bit=True # For efficient inference, if quantization is supported by the GPU card
82
  )
83
  ```
84
+ ### Sentence completion
85
 
86
  Wrap the model in a text generation pipeline, and prepare some generation parameters:
87
  ```
 
128
  Intermediate checkpoints can be loaded using the `revision` parameter:
129
  ```python
130
  model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
131
+ revision="step0753851",
132
  ...
133
  )
134
  ```
135
  where `revision` can be one of:
136
+ * "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": each 5000 steps for the first pre-training steps (with a context length of 4096).
137
+ * "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": each 25000 steps from 25k to 750k steps.
138
+ * "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context extension and annealing.
139
+ * "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context extension (with a context length of 32000).
140
 
141
  ## Training Details
142
 
 
167
 
168
  #### Neural Network Architecture
169
 
170
+ Lucie-7B has the same neural network architecture as [Llama3.1](https://huggingface.co/meta-llama/Llama-3.1-8B).
171
  It has exactly 6 706 958 336 free parameters,
172
  with the following hyperparameters:
173
  | **Hyperparameter** | **Value** |
174
  |---------------------------|---------|
175
+ | Vocabulary size (\# tokens)| 65 024 |
176
+ | \# transformer blocks | 32 |
177
+ | \# attention heads | 32 |
178
+ | \# key-value heads | 8 |
179
+ | Hidden size | 4 096 |
180
+ | Feed-Forward hidden size | 12 288 |
181
+ | Activation | `silu` |
182
+ | RMS norm epsilon | 1e-5 |
183
+
184
+ The parameter "theta" of Rotary Positional Embedding (RoPE) varied during the training process
185
+ and is indicated in the tables with training hyperparameters below.
186
 
187
  #### Training Hyperparameters
188
 
189
+ The training consisted of three main phases:
190
+ 1. Main pre-training on 3.1T tokens, with a context length of 4096,
191
+ 2. Context extension on 5B tokens, with a context length of 32000,
192
+ 3. Annealing, with a selected subset of the training data with especially high quality.
193
+
194
+ The details of each phase are given below.
195
+
196
+ ##### 1. Main pre-training
197
+
198
  Training hyperparameters in torch/Megatron-DeepSpeed were the following:
199
  | **Hyperparameter** | **Value** |
200
  |------------------------|------------|
201
+ | Total \# samples| 762 144 586 (3.1T tokens) |
202
+ | Total \# steps | 753 851 |
203
+ | RoPE theta | 500 000 |
204
+ | Context length | 4 096 |
205
+ | Initial Batch size | 256 |
206
+ | Final Batch size | 1 024 |
207
  | Batch size rampup | by steps of 64 over 10M samples |
208
+ | Learning rate schedule | warmup (2M samples) + cosine annealing |
 
209
  | Maximum Learning rate | 3e-4 |
210
  | Final Learning rate | 3e-5 |
211
  | Weight decay | 0.1 |
212
  | Dropout | _ |
213
  | Gradient clipping | 1 |
214
+ | Initializer range | 0.009 |
215
+ | Optimizer | `AdamW` (β₁=0.9, β₂=0.95, ε=1e-5) |
216
+ | Precision | `bfloat16` |
217
  | Tensor Parallelism (with 512 GPUs) | 4 |
218
  | Pipeline Parallelism (with 512 GPUs) | 4 |
219
  | Data Parallelism (with 512 GPUs) | 32 |
220
 
221
+ #### 2. Context Extension
222
+
223
+ Training hyperparameters are the same as above, with the following changes:
224
+ | **Hyperparameter** | **Value** |
225
+ |------------------------|------------|
226
+ | Total \# samples| 156 250 (5B tokens) |
227
+ | Total \# steps | 1 220 |
228
+ | RoPE theta | 20 000 000 |
229
+ | Context length | 32 000 |
230
+ | Batch size | 128 |
231
+ | Learning rate | 2e-5 |
232
+ | Tensor Parallelism (with 128 GPUs) | 4 |
233
+ | Pipeline Parallelism (with 128 GPUs) | 4 |
234
+ | Data Parallelism (with 128 GPUs) | 8 |
235
+
236
+ #### 3. Annealing
237
+
238
+ TODO
239
 
240
  ## Acknowledgements
241