Jeronymous
commited on
Update README.md with info about context extension pretraining phase
Browse files
README.md
CHANGED
@@ -36,11 +36,17 @@ https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/tem
|
|
36 |
* [Model Description](#model-description)
|
37 |
<!-- * [Uses](#uses) -->
|
38 |
* [Example code in python](#example-code-in-python)
|
|
|
39 |
* [Sentence completion](#sentence-completion)
|
40 |
* [Load a checkpoint](#load-a-checkpoint)
|
41 |
* [Training Details](#training-details)
|
42 |
* [Training Data](#training-data)
|
43 |
* [Training Procedure](#training-procedure)
|
|
|
|
|
|
|
|
|
|
|
44 |
<!-- * [Evaluation](#evaluation) -->
|
45 |
* [Acknowledgements](#acknowledgements)
|
46 |
* [Contact](#contact)
|
@@ -61,7 +67,7 @@ as well as several programming languages (14.7%).
|
|
61 |
|
62 |
## Example code in python
|
63 |
|
64 |
-
###
|
65 |
|
66 |
Load the model (quantized version on GPU if possible, for efficient inference):
|
67 |
```python
|
@@ -75,6 +81,7 @@ model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
|
|
75 |
load_in_4bit=True # For efficient inference, if quantization is supported by the GPU card
|
76 |
)
|
77 |
```
|
|
|
78 |
|
79 |
Wrap the model in a text generation pipeline, and prepare some generation parameters:
|
80 |
```
|
@@ -121,14 +128,15 @@ every 5000 steps during the first 25000 steps, and then every 25000 steps.
|
|
121 |
Intermediate checkpoints can be loaded using the `revision` parameter:
|
122 |
```python
|
123 |
model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
|
124 |
-
revision="
|
125 |
...
|
126 |
)
|
127 |
```
|
128 |
where `revision` can be one of:
|
129 |
-
* [
|
130 |
-
* [
|
131 |
-
* [
|
|
|
132 |
|
133 |
## Training Details
|
134 |
|
@@ -159,43 +167,75 @@ Optimizer checkpoints are available at [OpenLLM-France/Lucie-7B-optimizer-states
|
|
159 |
|
160 |
#### Neural Network Architecture
|
161 |
|
162 |
-
Lucie-7B has the same neural network architecture as Llama3.
|
163 |
It has exactly 6 706 958 336 free parameters,
|
164 |
with the following hyperparameters:
|
165 |
| **Hyperparameter** | **Value** |
|
166 |
|---------------------------|---------|
|
167 |
-
| Vocabulary size (\# tokens)| 65 024|
|
168 |
-
|
|
169 |
-
| \#
|
170 |
-
| \#
|
171 |
-
|
|
172 |
-
|
|
173 |
-
|
|
174 |
-
|
|
175 |
-
|
|
|
|
|
176 |
|
177 |
#### Training Hyperparameters
|
178 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
179 |
Training hyperparameters in torch/Megatron-DeepSpeed were the following:
|
180 |
| **Hyperparameter** | **Value** |
|
181 |
|------------------------|------------|
|
182 |
-
|
|
183 |
-
|
|
184 |
-
|
|
185 |
-
|
|
|
|
|
|
186 |
| Batch size rampup | by steps of 64 over 10M samples |
|
187 |
-
|
|
188 |
-
| Learning rate schedule | warmup + cosine annealing |
|
189 |
| Maximum Learning rate | 3e-4 |
|
190 |
| Final Learning rate | 3e-5 |
|
191 |
| Weight decay | 0.1 |
|
192 |
| Dropout | _ |
|
193 |
| Gradient clipping | 1 |
|
194 |
-
| Initializer range | 0.
|
|
|
|
|
195 |
| Tensor Parallelism (with 512 GPUs) | 4 |
|
196 |
| Pipeline Parallelism (with 512 GPUs) | 4 |
|
197 |
| Data Parallelism (with 512 GPUs) | 32 |
|
198 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
199 |
|
200 |
## Acknowledgements
|
201 |
|
|
|
36 |
* [Model Description](#model-description)
|
37 |
<!-- * [Uses](#uses) -->
|
38 |
* [Example code in python](#example-code-in-python)
|
39 |
+
* [Load the model](#load-the-model)
|
40 |
* [Sentence completion](#sentence-completion)
|
41 |
* [Load a checkpoint](#load-a-checkpoint)
|
42 |
* [Training Details](#training-details)
|
43 |
* [Training Data](#training-data)
|
44 |
* [Training Procedure](#training-procedure)
|
45 |
+
* [Neural Network Architecture](#neural-network-architecture)
|
46 |
+
* [Training Hyperparameters](#training-hyperparameters)
|
47 |
+
1. [Main pre-training](#1-main-pre-training)
|
48 |
+
2. [Context Extension](#2-context-extension)
|
49 |
+
3. [Annealing](#3-annealing)
|
50 |
<!-- * [Evaluation](#evaluation) -->
|
51 |
* [Acknowledgements](#acknowledgements)
|
52 |
* [Contact](#contact)
|
|
|
67 |
|
68 |
## Example code in python
|
69 |
|
70 |
+
### Load the model
|
71 |
|
72 |
Load the model (quantized version on GPU if possible, for efficient inference):
|
73 |
```python
|
|
|
81 |
load_in_4bit=True # For efficient inference, if quantization is supported by the GPU card
|
82 |
)
|
83 |
```
|
84 |
+
### Sentence completion
|
85 |
|
86 |
Wrap the model in a text generation pipeline, and prepare some generation parameters:
|
87 |
```
|
|
|
128 |
Intermediate checkpoints can be loaded using the `revision` parameter:
|
129 |
```python
|
130 |
model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
|
131 |
+
revision="step0753851",
|
132 |
...
|
133 |
)
|
134 |
```
|
135 |
where `revision` can be one of:
|
136 |
+
* "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": each 5000 steps for the first pre-training steps (with a context length of 4096).
|
137 |
+
* "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": each 25000 steps from 25k to 750k steps.
|
138 |
+
* "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context extension and annealing.
|
139 |
+
* "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context extension (with a context length of 32000).
|
140 |
|
141 |
## Training Details
|
142 |
|
|
|
167 |
|
168 |
#### Neural Network Architecture
|
169 |
|
170 |
+
Lucie-7B has the same neural network architecture as [Llama3.1](https://huggingface.co/meta-llama/Llama-3.1-8B).
|
171 |
It has exactly 6 706 958 336 free parameters,
|
172 |
with the following hyperparameters:
|
173 |
| **Hyperparameter** | **Value** |
|
174 |
|---------------------------|---------|
|
175 |
+
| Vocabulary size (\# tokens)| 65 024 |
|
176 |
+
| \# transformer blocks | 32 |
|
177 |
+
| \# attention heads | 32 |
|
178 |
+
| \# key-value heads | 8 |
|
179 |
+
| Hidden size | 4 096 |
|
180 |
+
| Feed-Forward hidden size | 12 288 |
|
181 |
+
| Activation | `silu` |
|
182 |
+
| RMS norm epsilon | 1e-5 |
|
183 |
+
|
184 |
+
The parameter "theta" of Rotary Positional Embedding (RoPE) varied during the training process
|
185 |
+
and is indicated in the tables with training hyperparameters below.
|
186 |
|
187 |
#### Training Hyperparameters
|
188 |
|
189 |
+
The training consisted of three main phases:
|
190 |
+
1. Main pre-training on 3.1T tokens, with a context length of 4096,
|
191 |
+
2. Context extension on 5B tokens, with a context length of 32000,
|
192 |
+
3. Annealing, with a selected subset of the training data with especially high quality.
|
193 |
+
|
194 |
+
The details of each phase are given below.
|
195 |
+
|
196 |
+
##### 1. Main pre-training
|
197 |
+
|
198 |
Training hyperparameters in torch/Megatron-DeepSpeed were the following:
|
199 |
| **Hyperparameter** | **Value** |
|
200 |
|------------------------|------------|
|
201 |
+
| Total \# samples| 762 144 586 (3.1T tokens) |
|
202 |
+
| Total \# steps | 753 851 |
|
203 |
+
| RoPE theta | 500 000 |
|
204 |
+
| Context length | 4 096 |
|
205 |
+
| Initial Batch size | 256 |
|
206 |
+
| Final Batch size | 1 024 |
|
207 |
| Batch size rampup | by steps of 64 over 10M samples |
|
208 |
+
| Learning rate schedule | warmup (2M samples) + cosine annealing |
|
|
|
209 |
| Maximum Learning rate | 3e-4 |
|
210 |
| Final Learning rate | 3e-5 |
|
211 |
| Weight decay | 0.1 |
|
212 |
| Dropout | _ |
|
213 |
| Gradient clipping | 1 |
|
214 |
+
| Initializer range | 0.009 |
|
215 |
+
| Optimizer | `AdamW` (β₁=0.9, β₂=0.95, ε=1e-5) |
|
216 |
+
| Precision | `bfloat16` |
|
217 |
| Tensor Parallelism (with 512 GPUs) | 4 |
|
218 |
| Pipeline Parallelism (with 512 GPUs) | 4 |
|
219 |
| Data Parallelism (with 512 GPUs) | 32 |
|
220 |
|
221 |
+
#### 2. Context Extension
|
222 |
+
|
223 |
+
Training hyperparameters are the same as above, with the following changes:
|
224 |
+
| **Hyperparameter** | **Value** |
|
225 |
+
|------------------------|------------|
|
226 |
+
| Total \# samples| 156 250 (5B tokens) |
|
227 |
+
| Total \# steps | 1 220 |
|
228 |
+
| RoPE theta | 20 000 000 |
|
229 |
+
| Context length | 32 000 |
|
230 |
+
| Batch size | 128 |
|
231 |
+
| Learning rate | 2e-5 |
|
232 |
+
| Tensor Parallelism (with 128 GPUs) | 4 |
|
233 |
+
| Pipeline Parallelism (with 128 GPUs) | 4 |
|
234 |
+
| Data Parallelism (with 128 GPUs) | 8 |
|
235 |
+
|
236 |
+
#### 3. Annealing
|
237 |
+
|
238 |
+
TODO
|
239 |
|
240 |
## Acknowledgements
|
241 |
|