|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- CohereForAI/aya_dataset |
|
- argilla/databricks-dolly-15k-curated-multilingual |
|
- Gael540/dataSet_ens_sup_fr-v1 |
|
- ai2-adapt-dev/flan_v2_converted |
|
- OpenAssistant/oasst1 |
|
language: |
|
- fr |
|
- en |
|
- de |
|
- it |
|
- es |
|
base_model: |
|
- OpenLLM-France/Lucie-7B |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Model Card for Lucie-7B-Instruct-human-data |
|
|
|
* [Model Description](#model-description) |
|
<!-- * [Uses](#uses) --> |
|
* [Training Details](#training-details) |
|
* [Training Data](#training-data) |
|
* [Preprocessing](#preprocessing) |
|
* [Instruction template](#instruction-template) |
|
* [Training Procedure](#training-procedure) |
|
<!-- * [Evaluation](#evaluation) --> |
|
* [Testing the model](#testing-the-model) |
|
* [Test with ollama](#test-with-ollama) |
|
* [Test with vLLM](#test-with-vllm) |
|
* [Citation](#citation) |
|
* [Acknowledgements](#acknowledgements) |
|
* [Contact](#contact) |
|
|
|
## Model Description |
|
|
|
Lucie-7B-Instruct-human-data is a fine-tuned version of [Lucie-7B](), an open-source, multilingual causal language model created by OpenLLM-France. |
|
|
|
Lucie-7B-Instruct-human-data is fine-tuned on human-produced instructions collected either from open annotation campaigns or by applying templates to extant datasets. The performance of Lucie-7B-Instruct-human-data falls below that of [Lucie-7B-Instruct](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct); the interest of the model is to show what can be done to fine-tune LLMs to follow instructions without appealing to third party LLMs. |
|
|
|
While Lucie-7B-Instruct-human-data is trained on sequences of 4096 tokens, its base model, Lucie-7B has a context size of 32K tokens. Based on Needle-in-a-haystack evaluations, Lucie-7B-Instruct-human-data maintains the capacity of the base model to handle 32K-size context windows. |
|
|
|
## Training details |
|
### Training data |
|
|
|
Lucie-7B-Instruct-human-data is trained on the following datasets published by third parties: |
|
* [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) (English, 3944 samples; French, 1422; German, 241; Italian, 738; Spanish, 3854) |
|
* [Dolly](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual) (English, French, German, Spanish; 15015 x 4 samples) |
|
* [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples) |
|
* [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78580 samples) |
|
* [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1) (English, 21151 samples; French, 1223; German, 1515; Italian, 370; Spanish, 14078) |
|
* [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4613 samples) |
|
* [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1849 samples) |
|
|
|
|
|
And the following datasets developed for the Lucie instruct models: |
|
* [Croissant Aligned Instruct](https://huggingface.co/datasets/OpenLLM-France/Croissant-Aligned-Instruct) (French-English, 20K examples sampled randomly from 80K total) |
|
* Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x)) |
|
* French: openllm_french.jsonl (24x10 samples) |
|
* English: openllm_english.jsonl (24x10 samples) |
|
|
|
### Preprocessing |
|
* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained. |
|
* Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...). |
|
|
|
### Instruction template: |
|
Lucie-7B-Instruct-human-data was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template: |
|
|
|
``` |
|
<s><|start_header_id|>system<|end_header_id|> |
|
|
|
{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|> |
|
|
|
{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|> |
|
|
|
{OUTPUT}<|eot_id|> |
|
``` |
|
|
|
|
|
An example: |
|
|
|
|
|
``` |
|
<s><|start_header_id|>system<|end_header_id|> |
|
|
|
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> |
|
|
|
Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|> |
|
|
|
1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|> |
|
``` |
|
|
|
### Training procedure |
|
|
|
The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions: |
|
* context length: 4096<sup>*</sup> |
|
* batch size: 1024 |
|
* max learning rate: 3e-5 |
|
* min learning rate: 3e-6 |
|
|
|
<sup>*</sup>As noted above, while Lucie-7B-Instruct is trained on sequences of 4096 tokens, it maintains the capacity of the base model, Lucie-7B, to handle context sizes of up to 32K tokens. |
|
|
|
## Testing the model |
|
|
|
### Test with ollama |
|
|
|
* Download and install [Ollama](https://ollama.com/download) |
|
* Download the [GGUF model](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-human-data/resolve/main/Lucie-7B-q4_k_m.gguf) |
|
* Copy the [`Modelfile`](Modelfile), adpating if necessary the path to the GGUF file (line starting with `FROM`). |
|
* Run in a shell: |
|
* `ollama create -f Modelfile Lucie` |
|
* `ollama run Lucie` |
|
* Once ">>>" appears, type your prompt(s) and press Enter. |
|
* Optionally, restart a conversation by typing "`/clear`" |
|
* End the session by typing "`/bye`". |
|
|
|
Useful for debug: |
|
* [How to print input requests and output responses in Ollama server?](https://stackoverflow.com/a/78831840) |
|
* [Documentation on Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter) |
|
* Examples: [Ollama model library](https://github.com/ollama/ollama#model-library) |
|
* Llama 3 example: https://ollama.com/library/llama3.1 |
|
* Add GUI : https://docs.openwebui.com/ |
|
|
|
### Test with vLLM |
|
|
|
#### 1. Run vLLM Docker Container |
|
|
|
Use the following command to deploy the model, |
|
replacing `INSERT_YOUR_HF_TOKEN` with your Hugging Face Hub token. |
|
|
|
```bash |
|
docker run --runtime nvidia --gpus=all \ |
|
--env "HUGGING_FACE_HUB_TOKEN=INSERT_YOUR_HF_TOKEN" \ |
|
-p 8000:8000 \ |
|
--ipc=host \ |
|
vllm/vllm-openai:latest \ |
|
--model OpenLLM-France/Lucie-7B-Instruct-human-data |
|
``` |
|
|
|
#### 2. Test using OpenAI Client in Python |
|
|
|
To test the deployed model, use the OpenAI Python client as follows: |
|
|
|
```python |
|
from openai import OpenAI |
|
|
|
# Initialize the client |
|
client = OpenAI(base_url='http://localhost:8000/v1', api_key='empty') |
|
|
|
# Define the input content |
|
content = "Hello Lucie" |
|
|
|
# Generate a response |
|
chat_response = client.chat.completions.create( |
|
model="OpenLLM-France/Lucie-7B-Instruct-human-data", |
|
messages=[ |
|
{"role": "user", "content": content} |
|
], |
|
) |
|
print(chat_response.choices[0].message.content) |
|
``` |
|
|
|
## Citation |
|
|
|
When using the Lucie-7B-Instruct-human-data model, please cite the following paper: |
|
|
|
✍ Olivier Gouvert, Julie Hunter, Jérôme Louradour, |
|
Evan Dufraisse, Yaya Sy, Pierre-Carl Langlais, Anastasia Stasenko, |
|
Laura Rivière, Christophe Cerisara, Jean-Pierre Lorré (2025) |
|
Lucie-7B LLM and its training dataset |
|
```bibtex |
|
@misc{openllm2023claire, |
|
title={The Lucie-7B LLM and the Lucie Training Dataset: |
|
open resources for multilingual language generation}, |
|
author={Olivier Gouvert and Julie Hunter and Jérôme Louradour and Evan Dufraisse and Yaya Sy and Pierre-Carl Langlais and Anastasia Stasenko and Laura Rivière and Christophe Cerisara and Jean-Pierre Lorré}, |
|
year={2025}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
## Acknowledgements |
|
|
|
This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444). We gratefully acknowledge support from GENCI and IDRIS and from Pierre-François Lavallée (IDRIS) and Stephane Requena (GENCI) in particular. |
|
|
|
Lucie-7B was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order: |
|
Olivier Gouvert (LINAGORA), |
|
Ismaïl Harrando (LINAGORA/SciencesPo), |
|
Julie Hunter (LINAGORA), |
|
Jean-Pierre Lorré (LINAGORA), |
|
Jérôme Louradour (LINAGORA), |
|
Michel-Marie Maudet (LINAGORA), and |
|
Laura Rivière (LINAGORA). |
|
|
|
|
|
We thank |
|
Clément Bénesse (Opsci), |
|
Christophe Cerisara (LORIA), |
|
Evan Dufraisse (CEA), |
|
Guokan Shang (MBZUAI), |
|
Joël Gombin (Opsci), |
|
Jordan Ricker (Opsci), |
|
and |
|
Olivier Ferret (CEA) |
|
for their helpful input. |
|
|
|
Finally, we thank the entire OpenLLM-France community, whose members have helped in diverse ways. |
|
|
|
## Contact |
|
|
|
[email protected] |