OpenLLM-France
/

Lucie-7B

@@ -19,10 +19,12 @@ widget:
       Quelle est la capitale de la France ?
     example_title: Capital cities in French
     group: 1-shot Question Answering
-training_progress:
-  num_steps: 756291
-  num_tokens: 3131736326144
-  context_length: 32000
 ---
 # Model Card for Lucie-7B
@@ -33,7 +35,7 @@ https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/tem
 * [Model Description](#model-description)
 <!-- * [Uses](#uses) -->
-* [Example code in python](#example-code-in-python)
   * [Load the model](#load-the-model)
   * [Sentence completion](#sentence-completion)
   * [Load a checkpoint](#load-a-checkpoint)
@@ -42,11 +44,13 @@ https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/tem
   * [Training Procedure](#training-procedure)
     * [Neural Network Architecture](#neural-network-architecture)
     * [Training Hyperparameters](#training-hyperparameters)
-      1. [Main pre-training](#1-main-pre-training)
       2. [Context Extension](#2-context-extension)
       3. [Annealing](#3-annealing)
-  * [Training logs and learning curves](#training-logs-and-learning-curves)
 <!-- * [Evaluation](#evaluation) -->
 * [Acknowledgements](#acknowledgements)
 * [Contact](#contact)
@@ -64,7 +68,7 @@ Italian (3.8%),
 and parallel data from those languages (2.5%),
 as well as several programming languages (14.7%).
-## Example code in python
 ### Load the model
@@ -82,7 +86,7 @@ model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
 ```
 ### Sentence completion
-Wrap the model in a text generation pipeline, and prepare some generation parameters:
 ```
 pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
@@ -132,8 +136,8 @@ model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
 )
 ```
 where `revision` can be one of:
-* "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": each 5000 steps for the first pre-training steps (with a context length of 4096).
-* "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": each 25000 steps from 25k to 750k steps.
 * "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context extension and annealing.
 * "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context extension (with a context length of 32000).
@@ -149,7 +153,7 @@ The initial composition of the training data is as follows:
 ![Initial Data Composition](figures/fig_dataset_composition.png)
-Some of the data was upsampled to balance the training data distribution, and the final composition is as follows:
 ![Training Data Composition](figures/fig_dataset_composition_training.png)
@@ -157,7 +161,7 @@ Some of the data was upsampled to balance the training data distribution, and th
 Lucie-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
-It was pre-trained on 512 H100 80GB GPUs for about 550\,000 GPU hours on [Jean Zay supercomputer](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html).
 The training code is available at [https://github.com/OpenLLM-France/Lucie-Training](https://github.com/OpenLLM-France/Lucie-Training).
 It is based on [this fork of Megatron-DeepSpeed](https://github.com/OpenLLM-France/Megatron-DeepSpeed).
@@ -180,21 +184,21 @@ with the following hyperparameters:
 | Activation                |  `silu` |
 | RMS norm epsilon          |    1e-5 |
-The parameter "theta" of Rotary Positional Embedding (RoPE) varied during the training process
-and is indicated in the tables with training hyperparameters below.
 #### Training Hyperparameters
 The training consisted of three main phases:
 1. Main pre-training on 3.1T tokens, with a context length of 4096,
 2. Context extension on 5B tokens, with a context length of 32000,
-3. Annealing, with a selected subset of the training data with especially high quality.
 The details of each phase are given below.
-##### 1. Main pre-training
-Training hyperparameters in torch/Megatron-DeepSpeed were the following:
 | **Hyperparameter**     | **Value**  |
 |------------------------|------------|
 | Total \# samples| 762 144 586 (3.1T tokens) |
@@ -236,9 +240,7 @@ Training hyperparameters are the same as above, with the following changes:
 TODO
-### Training logs and learning curves
-🚧 work in progress 🚧
 Training logs can be found in Tensorboard format in:
 * [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
@@ -246,6 +248,24 @@ Training logs can be found in Tensorboard format in:
 in a zip file. Each file in the zip corresponds to a job of at most 20H of training (parallelized over 512 GPUs).
 <br> └── [`2_extension/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/2_extension) folder containing the training log for the context extension phase, which was done in a single job of around 13H of training (parallelized over 128 GPUs).
 ## Acknowledgements
 This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444).
@@ -257,10 +277,22 @@ Julie Hunter (LINAGORA),
 Jean-Pierre Lorré (LINAGORA),
 Jérôme Louradour (LINAGORA),
 Michel-Marie Maudet (LINAGORA),
-Olivier Gouvert (LINAGORA),
-Pierre-Carl Langlais (OpSci),
 Yaya Sy (LORIA).
 ## Contact
 [email protected]

       Quelle est la capitale de la France ?
     example_title: Capital cities in French
     group: 1-shot Question Answering
+# inference:
+#     parameters:
+#         temperature: 1.0
+#         top_p: 1.0
+#         top_k: null
+#         max_new_tokens: null
 ---
 # Model Card for Lucie-7B
 * [Model Description](#model-description)
 <!-- * [Uses](#uses) -->
+* [Example Code in Python](#example-code-in-python)
   * [Load the model](#load-the-model)
   * [Sentence completion](#sentence-completion)
   * [Load a checkpoint](#load-a-checkpoint)
   * [Training Procedure](#training-procedure)
     * [Neural Network Architecture](#neural-network-architecture)
     * [Training Hyperparameters](#training-hyperparameters)
+      1. [Main Pre-training](#1-main-pre-training)
       2. [Context Extension](#2-context-extension)
       3. [Annealing](#3-annealing)
+  * [Training Logs and Learning Curves](#training-logs-and-learning-curves)
 <!-- * [Evaluation](#evaluation) -->
+* [Disclaimer](#disclaimer)
+* [Citation](#citation)
 * [Acknowledgements](#acknowledgements)
 * [Contact](#contact)
 and parallel data from those languages (2.5%),
 as well as several programming languages (14.7%).
+## Example Code in Python
 ### Load the model
 ```
 ### Sentence completion
+Wrap the model in a text generation pipeline, and specify some generation parameters:
 ```
 pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)
 )
 ```
 where `revision` can be one of:
+* "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": every 5000 steps for the first pre-training steps (with a context length of 4096).
+* "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": every 25000 steps from 25k to 750k steps.
 * "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context extension and annealing.
 * "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context extension (with a context length of 32000).
 ![Initial Data Composition](figures/fig_dataset_composition.png)
+Some of the data was upsampled to balance the training data distribution yielding the following  composition for training:
 ![Training Data Composition](figures/fig_dataset_composition_training.png)
 Lucie-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
+It was pre-trained on 512 H100 80GB GPUs for about 550\,000 GPU hours on the [Jean Zay supercomputer](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html).
 The training code is available at [https://github.com/OpenLLM-France/Lucie-Training](https://github.com/OpenLLM-France/Lucie-Training).
 It is based on [this fork of Megatron-DeepSpeed](https://github.com/OpenLLM-France/Megatron-DeepSpeed).
 | Activation                |  `silu` |
 | RMS norm epsilon          |    1e-5 |
+The "theta" parameter of Rotary Positional Embedding (RoPE) was increased during the training process. Its values are indicated in the tables with training hyperparameters below.
 #### Training Hyperparameters
 The training consisted of three main phases:
 1. Main pre-training on 3.1T tokens, with a context length of 4096,
 2. Context extension on 5B tokens, with a context length of 32000,
+3. Annealing on 5B tokens of high quality data composed of a mixture of new data and data seen during training.
+<!-- perhaps cite the dataset for annealing  -->
 The details of each phase are given below.
+##### 1. Main Pre-training
+Training hyperparameters in torch/Megatron-DeepSpeed were as follows:
 | **Hyperparameter**     | **Value**  |
 |------------------------|------------|
 | Total \# samples| 762 144 586 (3.1T tokens) |
 TODO
+### Training Logs and Learning Curves
 Training logs can be found in Tensorboard format in:
 * [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
 in a zip file. Each file in the zip corresponds to a job of at most 20H of training (parallelized over 512 GPUs).
 <br> └── [`2_extension/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/2_extension) folder containing the training log for the context extension phase, which was done in a single job of around 13H of training (parallelized over 128 GPUs).
+🚧 TODO: Plot convergence curve (and link CSV ?) 🚧
+Evaluation results on benchmark datasets of checkpoints of Lucie-7B throughout the training process are available at
+[metadata/evaluation_learning_curve_lucie.csv](metadata/evaluation_learning_curve_lucie.csv).
+Evaluation results of baseline models on the same benchmark datasets are available at
+[metadata/evaluation_baselines.csv](metadata/evaluation_baselines.csv).
+🚧 TODO: Plot learning curves 🚧
+## Disclaimer
+Lucie-7B is a language model trained solely to predict the most probable next word in a sequence. Despite efforts to filter the [Lucie Training Dataset](https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset), it is possible that Lucie-7B encountered strings containing toxic or offensive language during its training and as a result, it may generate strings of similar quality. To limit such behavior, it is advised to fine-tune Lucie-7B through instruction and/or preference tuning (DPO, RLHF, etc.).
+## Citation
+TODO
 ## Acknowledgements
 This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444).
 Jean-Pierre Lorré (LINAGORA),
 Jérôme Louradour (LINAGORA),
 Michel-Marie Maudet (LINAGORA),
+Olivier Gouvert (LINAGORA), and
 Yaya Sy (LORIA).
+We thank
+Anastasia Stasenko (OpSci/Pleias),
+Clément Bénesse (Opsci),
+Guokan Shang (MBZUAI),
+Ismaïl Harrando (LINAGORA),
+Joël Gombin (Opsci),
+Jordan Ricker (Opsci),
+Olivier Ferret (CEA),
+Pierre-Carl Langlais (OpSci/Pleias),
+and
+Rachel Bawden (INRIA),
+for their helpful input.
 ## Contact
 [email protected]