File size: 16,602 Bytes
cf2f8e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13f5b08
 
 
 
cf2f8e6
15dffa4
 
 
 
 
 
 
 
 
29f25c8
2c5a377
15dffa4
 
 
 
 
2c5a377
 
29f25c8
ce0348f
2c5a377
29f25c8
15dffa4
29f25c8
 
15dffa4
 
 
 
 
c01f014
15dffa4
 
 
 
 
 
 
 
 
 
29f25c8
15dffa4
2c5a377
15dffa4
 
 
 
 
 
 
 
 
 
 
 
 
2c5a377
15dffa4
29f25c8
15dffa4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2c5a377
15dffa4
 
 
8fea09c
29f25c8
 
ce0348f
 
15dffa4
 
 
 
 
8fea09c
 
 
15dffa4
8fea09c
 
dd715ce
15dffa4
29f25c8
8fea09c
dd715ce
8fea09c
 
15dffa4
 
 
29f25c8
8fea09c
 
 
 
 
15dffa4
 
 
2c5a377
15dffa4
 
 
 
2c5a377
 
 
 
 
 
 
 
 
29f25c8
15dffa4
 
 
2c5a377
 
 
29f25c8
 
2c5a377
 
 
29f25c8
2c5a377
29f25c8
15dffa4
 
2c5a377
 
 
 
 
 
15dffa4
2c5a377
15dffa4
 
 
 
 
2c5a377
 
 
15dffa4
 
 
 
ce0348f
2c5a377
 
 
 
 
 
 
 
 
 
ce0348f
2c5a377
 
 
 
 
 
ce0348f
 
 
 
 
 
 
 
15dffa4
29f25c8
1565bfe
77bd40f
 
1565bfe
1aa0195
1565bfe
 
77bd40f
 
 
 
 
 
 
 
 
 
 
 
 
1565bfe
77bd40f
 
29f25c8
 
 
 
 
 
77bd40f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce0348f
77bd40f
 
 
 
 
29f25c8
 
 
 
 
 
 
ce0348f
 
 
 
 
 
 
 
ffdbe00
ce0348f
 
 
 
 
 
 
29f25c8
 
15dffa4
 
4cf351f
15dffa4
ce0348f
 
 
15dffa4
 
 
 
 
ce0348f
15dffa4
29f25c8
15dffa4
 
29f25c8
 
 
 
 
 
 
4cf351f
 
29f25c8
 
 
 
 
 
4cf351f
 
 
 
15dffa4
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
---
license: apache-2.0
pipeline_tag: text-generation
language:
- fr
- en
- it
- de
- es
tags:
- pretrained
- llama-3
- openllm-france
datasets:
  - OpenLLM-France/Lucie-Training-Dataset
widget:
  - text: |-
      Quelle est la capitale de l'Espagne ? Madrid.
      Quelle est la capitale de la France ?
    example_title: Capital cities in French
    group: 1-shot Question Answering
training_progress:
  num_steps: 756291
  num_tokens: 3131736326144
  context_length: 32000
---

# Model Card for Lucie-7B

<!-- inspired from the following template:
https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1
-->

* [Model Description](#model-description)
<!-- * [Uses](#uses) -->
* [Example Code in Python](#example-code-in-python)
  * [Load the model](#load-the-model)
  * [Sentence completion](#sentence-completion)
  * [Load a checkpoint](#load-a-checkpoint)
* [Training Details](#training-details)
  * [Training Data](#training-data)
  * [Training Procedure](#training-procedure)
    * [Neural Network Architecture](#neural-network-architecture)
    * [Training Hyperparameters](#training-hyperparameters)
      1. [Main Pre-training](#1-main-pre-training)
      2. [Context Length Extension](#2-context-extension)
      3. [Annealing](#3-annealing)
  * [Training Logs and Learning Curves](#training-logs-and-learning-curves)
<!-- * [Evaluation](#evaluation) -->
* [Disclaimer](#disclaimer)
* [Citation](#citation)
* [Acknowledgements](#acknowledgements)
* [Contact](#contact)

## Model Description

Lucie-7B is a pretrained 7B parameter causal language model built by [LINAGORA](https://labs.linagora.com/) and [OpenLLM-France](https://github.com/OpenLLM-France). While Lucie-7B is released under an [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0), we are still in discussion with the copyright holder of one subcorpus of the Lucie-Training-Dataset and so ask that usage of the model be limited to non-commercial purposes until the details of this discussion are finalized.

Lucie-7B was trained on 3 trillion tokens of multilingual data, including
English (33.2%),
French (32.4%),
German (6.9%),
Spanish (6.6%),
Italian (3.8%),
and parallel data from those languages (2.5%),
as well as several programming languages (14.7%).

## Example Code in Python

### Load the model

Load the model (quantized version on GPU if possible, for efficient inference):
```python
import transformers

model_name = "OpenLLM-France/Lucie-7B"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
    device_map="auto",
    load_in_4bit=True       # For efficient inference, if quantization is supported by the GPU card
)
```
### Sentence completion

Wrap the model in a text generation pipeline, and specify some generation parameters:
```
pipeline = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer)

generation_kwargs = dict(
    num_return_sequences=1,               # Number of variants to generate.
    return_full_text= False,              # Do not include the prompt in the generated text.
    do_sample=True,
    temperature=1.0, top_p=1, top_k=None, # Sampling parameters.
    max_new_tokens=200,                   # Maximum length for the output text (in number of tokens).
)
```

Try 1-shot question answering:
```python
prompt = """\
Quelle est la capitale de l'Espagne ? Madrid\n\
Quelle est la capitale de la France ?\
"""
completions = pipeline(prompt, **generation_kwargs)
for completion in completions:
    print(prompt + " […]" + completion['generated_text'])
```
This will print something like:
```
Quelle est la capitale de l'Espagne ? Madrid
Quelle est la capitale de la France ? […] Paris
Quelle est la capitale de l'Italie? Rome
Quelle est la capitale de la Grande-Bretagne? Londres
Quelle est la capitale de la Suisse? Berne
Quelle est la capitale du Portugal? Lisbonne
Quelle est la capitale de l'Algérie? Alger
...
```

If running on GPU (`cuda` device), you will need at least 6GB of VRAM to run inference using 4bit quantization (16GB of VRAM without 4bit quantization).

### Load a checkpoint

Checkpoints at several training steps are available under revision tags,
every 5000 steps during the first 25000 steps, and then every 25000 steps.

Intermediate checkpoints can be loaded using the `revision` parameter:
```python
model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
    revision="step0753851",
    ...
)
```
where `revision` can be one of:
* "[`step0005000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0005000)", "[`step0010000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0010000)", "[`step0015000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0015000)", "[`step0020000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0020000)": every 5000 steps for the first pre-training steps (with a context length of 4096).
* "[`step0025000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0025000)", "[`step0050000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0050000)", "[`step0075000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0075000)", "[`step0100000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0100000)", ..., "[`step0750000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0750000)": every 25000 steps from 25k to 750k steps.
* "[`step0753851`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/step0753851)": last pre-training step before context length extension and annealing.
* "[`extension_step0000250`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000250)", "[`extension_step0000500`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000500)", "[`extension_step0000750`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0000750)", "[`extension_step0001000`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001000)", "[`extension_step0001220`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/extension_step0001220)": several checkpoints during context length extension (with a context length of 32000).

## Training Details

### Training Data

The training dataset used for the pretraining of Lucie-7B is available
at [OpenLLM-France/Lucie-Training-Dataset](https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset).
<!-- and described in ["The Lucie Training Dataset" (2024/12)](https://arxiv.org/abs/xxxx.xxxxx). -->

The initial composition of the training data is as follows:

![Initial Data Composition](figures/pie_dataset_composition.png)

Some of the data was upsampled to balance the training data distribution yielding the following  composition for training:

![Training Data Composition](figures/pie_dataset_composition_training.png)

### Training Procedure 

Lucie-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).

It was pre-trained on 512 H100 80GB GPUs for about 550\,000 GPU hours on the [Jean Zay supercomputer](http://www.idris.fr/eng/jean-zay/jean-zay-presentation-eng.html).

The training code is available at [https://github.com/OpenLLM-France/Lucie-Training](https://github.com/OpenLLM-France/Lucie-Training).
It is based on [this fork of Megatron-DeepSpeed](https://github.com/OpenLLM-France/Megatron-DeepSpeed).

Optimizer checkpoints are available at [OpenLLM-France/Lucie-7B-optimizer-states](https://huggingface.co/OpenLLM-France/Lucie-7B-optimizer-states).

#### Neural Network Architecture

Lucie-7B has the same neural network architecture as [Llama3.1](https://huggingface.co/meta-llama/Llama-3.1-8B).
It has exactly 6 706 958 336 free parameters,
with the following hyperparameters:
| **Hyperparameter**        | **Value** |
|---------------------------|---------|
| Vocabulary size (\# tokens)| 65 024 |
| \# transformer blocks     |      32 |
| \# attention heads        |      32 |
| \# key-value heads        |       8 |
| Hidden size               |   4 096 |
| Feed-Forward hidden size  |  12 288 |
| Activation                |  `silu` |
| RMS norm epsilon          |    1e-5 |

The "theta" parameter of Rotary Positional Embedding (RoPE) was increased during the training process. Its values are indicated in the tables with training hyperparameters below.

#### Training Hyperparameters

The training consisted of three main phases:
1. Main pre-training on 3.1T tokens, with a context length of 4096,
2. Context extension on 5B tokens, with a context length of 32000,
3. Annealing on 5B tokens of high quality data composed of a mixture of new data and data seen during training.
<!-- perhaps cite the dataset for annealing  -->

The details of each phase are given below.

##### 1. Main Pre-training

Training hyperparameters in torch/Megatron-DeepSpeed were as follows:
| **Hyperparameter**     | **Value**  |
|------------------------|------------|
| Total \# samples| 762 144 586 (3.1T tokens) |
| Total \# steps  | 753 851    |
| RoPE theta             | 500 000    |
| Context length         | 4 096      |
| Initial Batch size     | 256        |
| Final Batch size       | 1 024      |
| Batch size rampup      | by steps of 64 over 10M samples |
| Learning rate schedule | warmup (2M samples) + cosine annealing |
| Maximum Learning rate  | 3e-4       |
| Final Learning rate    | 3e-5       |
| Weight decay           | 0.1        |
| Dropout                | _          |
| Gradient clipping      | 1          |
| Initializer range      | 0.009        |
| Optimizer              | `AdamW` (β₁=0.9, β₂=0.95, ε=1e-5)    |
| Precision              | `bfloat16` |
| Tensor Parallelism (with 512 GPUs)   | 4           |
| Pipeline Parallelism (with 512 GPUs) | 4           |
| Data Parallelism (with 512 GPUs)     | 32          |

#### 2. Context Length Extension

Training hyperparameters are the same as above, with the following changes:
| **Hyperparameter**     | **Value**  |
|------------------------|------------|
| Total \# samples| 156 250 (5B tokens) |
| Total \# steps  | 1 220      |
| RoPE theta             | 20 000 000 |
| Context length         | 32 000     |
| Batch size             | 128        |
| Learning rate          | 2e-5       |
| Learning rate schedule | constant   |
| Tensor Parallelism (with 128 GPUs)   | 4     |
| Pipeline Parallelism (with 128 GPUs) | 4     |
| Data Parallelism (with 128 GPUs)     | 8     |

#### 3. Annealing

Training hyperparameters are the same as for context length extension, with the following changes:
| **Hyperparameter**     | **Value**  |
|------------------------|------------|
| Total \# samples| 156 250 (5B tokens) |
| Total \# steps  | 1 220      |
| Learning rate schedule | linear annealing |
| Maximum Learning rate  | 3e-5       |
| Final Learning rate    | 0          |

### Training Logs and Learning Curves

#### Training loss

Training logs can be found in Tensorboard format in:
* [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
<br> ├── [`1_pretraining.zip`](metadata/training_logs/1_pretraining.zip) training logs for the first pre-training phases,
in a zip file. Each file in the zip corresponds to a job of at most 20H of training (parallelized over 512 GPUs).
<br> ├── [`2_extension/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/2_extension) folder containing the training log <br> └── [`3_annealing/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs/3_annealing) folder containing the training log for the annealing phase, which also took around 13H of training (parallelized over 128 GPUs).

The convergence curves of the three pre-training phases are the following:

![figures/convergence-curve-pretraining.png](figures/convergence-curve-pretraining.png)

Data corresponding to these plots were extracted from tensorboard logs and are available in the following CSV files:
* [`metadata/training_logs/`](https://huggingface.co/OpenLLM-France/Lucie-7B/tree/main/metadata/training_logs)
<br> ├── [`1_pretraining.csv`](metadata/training_logs/1_pretraining.csv)
<br> ├── [`2_extension.csv`](metadata/training_logs/2_extension.csv)
<br> └── [`3_annealing.csv`](metadata/training_logs/3_annealing.csv)

#### Evaluations

Multiple evaluations were conducted during Lucie-7B's training to assess its performance on standard benchmarks,
primarily in French and English, as well as in Spanish, German, and Italian.

Evaluation results on benchmark datasets of checkpoints of Lucie-7B throughout the training process are available at
[metadata/evaluation_learning_curve_lucie.csv](metadata/evaluation_learning_curve_lucie.csv).
Evaluation results of baseline models on the same benchmark datasets are available at
[metadata/evaluation_baselines.csv](metadata/evaluation_baselines.csv).

Main results are summarized in the following figures:

### French
![figures/learning-curve-evaluation-french-bench.png](figures/learning-curve-evaluation-french-bench.png)

### English
![figures/learning-curve-evaluation-benchmarks-in-english.png](figures/learning-curve-evaluation-benchmarks-in-english.png)

### other
![figures/learning-curve-evaluation-multilingual-arc-benchmark.png](figures/learning-curve-evaluation-multilingual-arc-benchmark.png)

### Needle in a Haystack

#### Pretraining
![figures/needle-in-a-haystack/Lucie-7B-main.png](figures/needle-in-a-haystack/Lucie-7B-main.png) 

#### Context Length Extension
![figures/needle-in-a-haystack/Lucie-7B-extension.png](figures/needle-in-a-haystack/Lucie-7B-extension.png) 

#### Annealing
![figures/needle-in-a-haystack/Lucie-7B-annealing.png](figures/needle-in-a-haystack/Lucie-7B-annealing.png) 


## Disclaimer

Lucie-7B is a language model trained solely to predict the most probable next word in a sequence. Despite efforts to filter the [Lucie Training Dataset](https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset), it is possible that Lucie-7B encountered strings containing toxic or offensive language during its training and as a result, it may generate strings of similar quality. To limit such behavior, it is advised to fine-tune Lucie-7B through instruction and/or preference tuning (DPO, RLHF, etc.).

## Citation

When using the Lucie-7B model, please cite the following paper:

✍ Olivier Gouvert, Julie Hunter, Jérôme Louradour,
Evan Dufraisse, Yaya Sy, Pierre-Carl Langlais, Anastasia Stasenko,
Laura Rivière, Christophe Cerisara, Jean-Pierre Lorré (2025)
Lucie-7B LLM and its training dataset
```bibtex
@misc{openllm2023claire,
      title={The Lucie-7B LLM and the Lucie Training Dataset:
      open resources for multilingual language generation}, 
      author={Olivier Gouvert and Julie Hunter and Jérôme Louradour and Evan Dufraisse and Yaya Sy and Pierre-Carl Langlais and Anastasia Stasenko and Laura Rivière and Christophe Cerisara and Jean-Pierre Lorré},
      year={2025},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```


## Acknowledgements

This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444). We gratefully acknowledge support from GENCI and IDRIS and from Pierre-François Lavallée (IDRIS) and Stephane Requena (GENCI) in particular.

Lucie-7B was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order:
Agustin Martin Picard (IRT),
Thibaut Boissin (IRT),
Christophe Cerisara (LORIA),
Evan Dufraisse (CEA),
Julie Hunter (LINAGORA),
Jean-Pierre Lorré (LINAGORA),
Jérôme Louradour (LINAGORA),
Lucas Hervier (IRT),
Michel-Marie Maudet (LINAGORA),
Olivier Gouvert (LINAGORA), and
Yaya Sy (LORIA).

We thank 
Anastasia Stasenko (OpSci/Pleias), 
Clément Bénesse (Opsci), 
Guokan Shang (MBZUAI), 
Ismaïl Harrando (LINAGORA), 
Joël Gombin (Opsci), 
Jordan Ricker (Opsci), 
Julien Tourille (EDF),
Manuel Faysse (ILLUIN Technology),
Olivier Ferret (CEA), 
Pierre-Carl Langlais (OpSci/Pleias), 
and 
Rachel Bawden (INRIA),
for their helpful input.

We also thank the support teams from IDRIS, in particular Myriam Peyrounette and Hatim Bourfoune, and from Hugging Face, in particular Thomas Wolf, Guilherme Penedo, Elie Bakouch, Haojun Zhao, and Lucain Pouget for their technical guidance.

Finally, we thank the entire OpenLLM-France community, whose members have helped in diverse ways. 

## Contact

[email protected]