Jeronymous commited on
Commit
29f25c8
·
verified ·
1 Parent(s): b94006b

Update README

Browse files
Files changed (1) hide show
  1. README.md +55 -23
README.md CHANGED
@@ -19,10 +19,12 @@ widget:
19
  Quelle est la capitale de la France ?
20
  example_title: Capital cities in French
21
  group: 1-shot Question Answering
22
- training_progress:
23
- num_steps: 756291
24
- num_tokens: 3131736326144
25
- context_length: 32000
 
 
26
  ---
27
 
28
  # Model Card for Lucie-7B
@@ -33,7 +35,7 @@ https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/tem
33
 
34
  * [Model Description](#model-description)
35
  <!-- * [Uses](#uses) -->
36
- * [Example code in python](#example-code-in-python)
37
  * [Load the model](#load-the-model)
38
  * [Sentence completion](#sentence-completion)
39
  * [Load a checkpoint](#load-a-checkpoint)
@@ -42,11 +44,13 @@ https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/tem
42
  * [Training Procedure](#training-procedure)
43
  * [Neural Network Architecture](#neural-network-architecture)
44
  * [Training Hyperparameters](#training-hyperparameters)
45
- 1. [Main pre-training](#1-main-pre-training)
46
  2. [Context Extension](#2-context-extension)
47
  3. [Annealing](#3-annealing)
48
- * [Training logs and learning curves](#training-logs-and-learning-curves)
49
  <!-- * [Evaluation](#evaluation) -->
 
 
50
  * [Acknowledgements](#acknowledgements)
51
  * [Contact](#contact)
52
 
@@ -64,7 +68,7 @@ Italian (3.8%),
64
  and parallel data from those languages (2.5%),
65
  as well as several programming languages (14.7%).
66
 
67
- ## Example code in python
68
 
69
  ### Load the model
70
 
@@ -82,7 +86,7 @@ model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
82
  ```
83
  ### Sentence completion
84
 
85
- Wrap the model in a text generation pipeline, and prepare some generation parameters:
86
  ```
87
  pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
88
 
@@ -132,8 +136,8 @@ model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
132
  )
133
  ```
134
  where `revision` can be one of:
135
- * "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": each 5000 steps for the first pre-training steps (with a context length of 4096).
136
- * "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": each 25000 steps from 25k to 750k steps.
137
  * "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context extension and annealing.
138
  * "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context extension (with a context length of 32000).
139
 
@@ -149,7 +153,7 @@ The initial composition of the training data is as follows:
149
 
150
  ![Initial Data Composition](figures/fig_dataset_composition.png)
151
 
152
- Some of the data was upsampled to balance the training data distribution, and the final composition is as follows:
153
 
154
  ![Training Data Composition](figures/fig_dataset_composition_training.png)
155
 
@@ -157,7 +161,7 @@ Some of the data was upsampled to balance the training data distribution, and th
157
 
158
  Lucie-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
159
 
160
- It was pre-trained on 512 H100 80GB GPUs for about 550\,000 GPU hours on [Jean Zay supercomputer](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html).
161
 
162
  The training code is available at [https://github.com/OpenLLM-France/Lucie-Training](https://github.com/OpenLLM-France/Lucie-Training).
163
  It is based on [this fork of Megatron-DeepSpeed](https://github.com/OpenLLM-France/Megatron-DeepSpeed).
@@ -180,21 +184,21 @@ with the following hyperparameters:
180
  | Activation | `silu` |
181
  | RMS norm epsilon | 1e-5 |
182
 
183
- The parameter "theta" of Rotary Positional Embedding (RoPE) varied during the training process
184
- and is indicated in the tables with training hyperparameters below.
185
 
186
  #### Training Hyperparameters
187
 
188
  The training consisted of three main phases:
189
  1. Main pre-training on 3.1T tokens, with a context length of 4096,
190
  2. Context extension on 5B tokens, with a context length of 32000,
191
- 3. Annealing, with a selected subset of the training data with especially high quality.
 
192
 
193
  The details of each phase are given below.
194
 
195
- ##### 1. Main pre-training
196
 
197
- Training hyperparameters in torch/Megatron-DeepSpeed were the following:
198
  | **Hyperparameter** | **Value** |
199
  |------------------------|------------|
200
  | Total \# samples| 762 144 586 (3.1T tokens) |
@@ -236,9 +240,7 @@ Training hyperparameters are the same as above, with the following changes:
236
 
237
  TODO
238
 
239
- ### Training logs and learning curves
240
-
241
- 🚧 work in progress 🚧
242
 
243
  Training logs can be found in Tensorboard format in:
244
  * [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
@@ -246,6 +248,24 @@ Training logs can be found in Tensorboard format in:
246
  in a zip file. Each file in the zip corresponds to a job of at most 20H of training (parallelized over 512 GPUs).
247
  <br> └── [`2_extension/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/2_extension) folder containing the training log for the context extension phase, which was done in a single job of around 13H of training (parallelized over 128 GPUs).
248
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
  ## Acknowledgements
250
 
251
  This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444).
@@ -257,10 +277,22 @@ Julie Hunter (LINAGORA),
257
  Jean-Pierre Lorré (LINAGORA),
258
  Jérôme Louradour (LINAGORA),
259
  Michel-Marie Maudet (LINAGORA),
260
- Olivier Gouvert (LINAGORA),
261
- Pierre-Carl Langlais (OpSci),
262
  Yaya Sy (LORIA).
263
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
  ## Contact
265
 
266
 
19
  Quelle est la capitale de la France ?
20
  example_title: Capital cities in French
21
  group: 1-shot Question Answering
22
+ # inference:
23
+ # parameters:
24
+ # temperature: 1.0
25
+ # top_p: 1.0
26
+ # top_k: null
27
+ # max_new_tokens: null
28
  ---
29
 
30
  # Model Card for Lucie-7B
 
35
 
36
  * [Model Description](#model-description)
37
  <!-- * [Uses](#uses) -->
38
+ * [Example Code in Python](#example-code-in-python)
39
  * [Load the model](#load-the-model)
40
  * [Sentence completion](#sentence-completion)
41
  * [Load a checkpoint](#load-a-checkpoint)
 
44
  * [Training Procedure](#training-procedure)
45
  * [Neural Network Architecture](#neural-network-architecture)
46
  * [Training Hyperparameters](#training-hyperparameters)
47
+ 1. [Main Pre-training](#1-main-pre-training)
48
  2. [Context Extension](#2-context-extension)
49
  3. [Annealing](#3-annealing)
50
+ * [Training Logs and Learning Curves](#training-logs-and-learning-curves)
51
  <!-- * [Evaluation](#evaluation) -->
52
+ * [Disclaimer](#disclaimer)
53
+ * [Citation](#citation)
54
  * [Acknowledgements](#acknowledgements)
55
  * [Contact](#contact)
56
 
 
68
  and parallel data from those languages (2.5%),
69
  as well as several programming languages (14.7%).
70
 
71
+ ## Example Code in Python
72
 
73
  ### Load the model
74
 
 
86
  ```
87
  ### Sentence completion
88
 
89
+ Wrap the model in a text generation pipeline, and specify some generation parameters:
90
  ```
91
  pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
92
 
 
136
  )
137
  ```
138
  where `revision` can be one of:
139
+ * "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": every 5000 steps for the first pre-training steps (with a context length of 4096).
140
+ * "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": every 25000 steps from 25k to 750k steps.
141
  * "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context extension and annealing.
142
  * "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context extension (with a context length of 32000).
143
 
 
153
 
154
  ![Initial Data Composition](figures/fig_dataset_composition.png)
155
 
156
+ Some of the data was upsampled to balance the training data distribution yielding the following composition for training:
157
 
158
  ![Training Data Composition](figures/fig_dataset_composition_training.png)
159
 
 
161
 
162
  Lucie-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
163
 
164
+ It was pre-trained on 512 H100 80GB GPUs for about 550\,000 GPU hours on the [Jean Zay supercomputer](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html).
165
 
166
  The training code is available at [https://github.com/OpenLLM-France/Lucie-Training](https://github.com/OpenLLM-France/Lucie-Training).
167
  It is based on [this fork of Megatron-DeepSpeed](https://github.com/OpenLLM-France/Megatron-DeepSpeed).
 
184
  | Activation | `silu` |
185
  | RMS norm epsilon | 1e-5 |
186
 
187
+ The "theta" parameter of Rotary Positional Embedding (RoPE) was increased during the training process. Its values are indicated in the tables with training hyperparameters below.
 
188
 
189
  #### Training Hyperparameters
190
 
191
  The training consisted of three main phases:
192
  1. Main pre-training on 3.1T tokens, with a context length of 4096,
193
  2. Context extension on 5B tokens, with a context length of 32000,
194
+ 3. Annealing on 5B tokens of high quality data composed of a mixture of new data and data seen during training.
195
+ <!-- perhaps cite the dataset for annealing -->
196
 
197
  The details of each phase are given below.
198
 
199
+ ##### 1. Main Pre-training
200
 
201
+ Training hyperparameters in torch/Megatron-DeepSpeed were as follows:
202
  | **Hyperparameter** | **Value** |
203
  |------------------------|------------|
204
  | Total \# samples| 762 144 586 (3.1T tokens) |
 
240
 
241
  TODO
242
 
243
+ ### Training Logs and Learning Curves
 
 
244
 
245
  Training logs can be found in Tensorboard format in:
246
  * [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
 
248
  in a zip file. Each file in the zip corresponds to a job of at most 20H of training (parallelized over 512 GPUs).
249
  <br> └── [`2_extension/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/2_extension) folder containing the training log for the context extension phase, which was done in a single job of around 13H of training (parallelized over 128 GPUs).
250
 
251
+ 🚧 TODO: Plot convergence curve (and link CSV ?) 🚧
252
+
253
+ Evaluation results on benchmark datasets of checkpoints of Lucie-7B throughout the training process are available at
254
+ [metadata/evaluation_learning_curve_lucie.csv](metadata/evaluation_learning_curve_lucie.csv).
255
+ Evaluation results of baseline models on the same benchmark datasets are available at
256
+ [metadata/evaluation_baselines.csv](metadata/evaluation_baselines.csv).
257
+
258
+ 🚧 TODO: Plot learning curves 🚧
259
+
260
+ ## Disclaimer
261
+
262
+ Lucie-7B is a language model trained solely to predict the most probable next word in a sequence. Despite efforts to filter the [Lucie Training Dataset](https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset), it is possible that Lucie-7B encountered strings containing toxic or offensive language during its training and as a result, it may generate strings of similar quality. To limit such behavior, it is advised to fine-tune Lucie-7B through instruction and/or preference tuning (DPO, RLHF, etc.).
263
+
264
+ ## Citation
265
+
266
+ TODO
267
+
268
+
269
  ## Acknowledgements
270
 
271
  This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444).
 
277
  Jean-Pierre Lorré (LINAGORA),
278
  Jérôme Louradour (LINAGORA),
279
  Michel-Marie Maudet (LINAGORA),
280
+ Olivier Gouvert (LINAGORA), and
 
281
  Yaya Sy (LORIA).
282
 
283
+ We thank
284
+ Anastasia Stasenko (OpSci/Pleias),
285
+ Clément Bénesse (Opsci),
286
+ Guokan Shang (MBZUAI),
287
+ Ismaïl Harrando (LINAGORA),
288
+ Joël Gombin (Opsci),
289
+ Jordan Ricker (Opsci),
290
+ Olivier Ferret (CEA),
291
+ Pierre-Carl Langlais (OpSci/Pleias),
292
+ and
293
+ Rachel Bawden (INRIA),
294
+ for their helpful input.
295
+
296
  ## Contact
297
 
298