juliehunter
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,171 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- CohereForAI/aya_dataset
|
5 |
+
- argilla/databricks-dolly-15k-curated-multilingual
|
6 |
+
- Gael540/dataSet_ens_sup_fr-v1
|
7 |
+
- ai2-adapt-dev/flan_v2_converted
|
8 |
+
- OpenAssistant/oasst1
|
9 |
+
language:
|
10 |
+
- fr
|
11 |
+
- en
|
12 |
+
- de
|
13 |
+
- it
|
14 |
+
- es
|
15 |
+
base_model:
|
16 |
+
- OpenLLM-France/Lucie-7B
|
17 |
+
pipeline_tag: text-generation
|
18 |
+
---
|
19 |
+
|
20 |
+
# Model Card for Lucie-7B-Instruct-human-data
|
21 |
+
|
22 |
+
* [Model Description](#model-description)
|
23 |
+
<!-- * [Uses](#uses) -->
|
24 |
+
* [Training Details](#training-details)
|
25 |
+
* [Training Data](#training-data)
|
26 |
+
* [Preprocessing](#preprocessing)
|
27 |
+
* [Training Procedure](#training-procedure)
|
28 |
+
<!-- * [Evaluation](#evaluation) -->
|
29 |
+
* [Testing the model](#testing-the-model)
|
30 |
+
* [Test in python](#test-in-python)
|
31 |
+
* [Test with ollama](#test-with-ollama)
|
32 |
+
* [Test with vLLM](#test-with-vllm)
|
33 |
+
* [Citation](#citation)
|
34 |
+
* [Acknowledgements](#acknowledgements)
|
35 |
+
* [Contact](#contact)
|
36 |
+
|
37 |
+
## Model Description
|
38 |
+
|
39 |
+
Lucie-7B-Instruct-human-data is a fine-tuned version of [Lucie-7B](), an open-source, multilingual causal language model created by OpenLLM-France.
|
40 |
+
|
41 |
+
Lucie-7B-Instruct-human-data is fine-tuned on human-produced instructions collected either from open annotation campaigns or by applying templates to extant datasets.
|
42 |
+
|
43 |
+
|
44 |
+
## Training details
|
45 |
+
### Training data
|
46 |
+
|
47 |
+
Lucie-7B-Instruct-human-data is trained on the following datasets published by third parties:
|
48 |
+
* [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) (English, 3944 samples; French, 1422; German, 241; Italian, 738; Spanish, 3854)
|
49 |
+
* [Dolly](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual) (English, French, German, Spanish; 15015 x 4 samples)
|
50 |
+
* [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples)
|
51 |
+
* [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78580 samples)
|
52 |
+
* [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1) (English, 21151 samples; French, 1223; German, 1515; Italian, 370; Spanish, 14078)
|
53 |
+
* [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4613 samples)
|
54 |
+
* [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1849 samples)
|
55 |
+
|
56 |
+
|
57 |
+
And the following datasets developed for the Lucie instruct models:
|
58 |
+
* Croissant Aligned Instruct (French-English, 20K examples sampled randomly from 80K total)
|
59 |
+
* Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x))
|
60 |
+
* French: openllm_french.jsonl (24x10 samples)
|
61 |
+
* English: openllm_english.jsonl (24x10 samples)
|
62 |
+
|
63 |
+
### Preprocessing
|
64 |
+
* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only English and French samples, respectively.
|
65 |
+
* Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|
66 |
+
* Duplicate examples were removed from Open Assistant.
|
67 |
+
|
68 |
+
### Training procedure
|
69 |
+
|
70 |
+
The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
|
71 |
+
* context length: 4096
|
72 |
+
* batch size: 1024
|
73 |
+
* max learning rate: 3e-5
|
74 |
+
* min learning rate: 3e-6
|
75 |
+
|
76 |
+
|
77 |
+
## Testing the model
|
78 |
+
### Test in python
|
79 |
+
|
80 |
+
* [test_transformers_gguf.py](test_transformers_gguf.py): Test GGUF model with `transformers` package (WARNING: loading the model is long)
|
81 |
+
|
82 |
+
### Test with ollama
|
83 |
+
|
84 |
+
* Download and install [Ollama](https://ollama.com/download)
|
85 |
+
* Download the [GGUF model](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1/resolve/main/Lucie-7B-q4_k_m.gguf)
|
86 |
+
* Copy the [`Modelfile`](Modelfile), adpating if necessary the path to the GGUF file (line starting with `FROM`).
|
87 |
+
* Run in a shell:
|
88 |
+
* `ollama create -f Modelfile Lucie`
|
89 |
+
* `ollama run Lucie`
|
90 |
+
* Once ">>>" appears, type your prompt(s) and press Enter.
|
91 |
+
* Optionally, restart a conversation by typing "`/clear`"
|
92 |
+
* End the session by typing "`/bye`".
|
93 |
+
|
94 |
+
Useful for debug:
|
95 |
+
* [How to print input requests and output responses in Ollama server?](https://stackoverflow.com/a/78831840)
|
96 |
+
* [Documentation on Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter)
|
97 |
+
* Examples: [Ollama model library](https://github.com/ollama/ollama#model-library)
|
98 |
+
* Llama 3 example: https://ollama.com/library/llama3.1
|
99 |
+
* Add GUI : https://docs.openwebui.com/
|
100 |
+
|
101 |
+
### Test with vLLM
|
102 |
+
|
103 |
+
#### 1. Run vLLM Docker Container
|
104 |
+
|
105 |
+
Use the following command to deploy the model,
|
106 |
+
replacing `INSERT_YOUR_HF_TOKEN` with your Hugging Face Hub token.
|
107 |
+
|
108 |
+
```bash
|
109 |
+
docker run --runtime nvidia --gpus=all \
|
110 |
+
--env "HUGGING_FACE_HUB_TOKEN=INSERT_YOUR_HF_TOKEN" \
|
111 |
+
-p 8000:8000 \
|
112 |
+
--ipc=host \
|
113 |
+
vllm/vllm-openai:latest \
|
114 |
+
--model OpenLLM-France/Lucie-7B-Instruct-v1
|
115 |
+
```
|
116 |
+
|
117 |
+
#### 2. Test using OpenAI Client in Python
|
118 |
+
|
119 |
+
To test the deployed model, use the OpenAI Python client as follows:
|
120 |
+
|
121 |
+
```python
|
122 |
+
from openai import OpenAI
|
123 |
+
|
124 |
+
# Initialize the client
|
125 |
+
client = OpenAI(base_url='http://localhost:8000/v1', api_key='empty')
|
126 |
+
|
127 |
+
# Define the input content
|
128 |
+
content = "Hello Lucie"
|
129 |
+
|
130 |
+
# Generate a response
|
131 |
+
chat_response = client.chat.completions.create(
|
132 |
+
model="OpenLLM-France/Lucie-7B-Instruct-v1",
|
133 |
+
messages=[
|
134 |
+
{"role": "user", "content": content}
|
135 |
+
],
|
136 |
+
)
|
137 |
+
print(chat_response.choices[0].message.content)
|
138 |
+
```
|
139 |
+
|
140 |
+
## Citation
|
141 |
+
|
142 |
+
TODO
|
143 |
+
|
144 |
+
## Acknowledgements
|
145 |
+
|
146 |
+
This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444).
|
147 |
+
|
148 |
+
Lucie-7B was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order:
|
149 |
+
Olivier Gouvert (LINAGORA),
|
150 |
+
Ismaïl Harrando (LINAGORA/SciencesPo),
|
151 |
+
Julie Hunter (LINAGORA),
|
152 |
+
Jean-Pierre Lorré (LINAGORA),
|
153 |
+
Jérôme Louradour (LINAGORA),
|
154 |
+
Michel-Marie Maudet (LINAGORA), and
|
155 |
+
Laura Rivière (LINAGORA).
|
156 |
+
|
157 |
+
|
158 |
+
We thank
|
159 |
+
Clément Bénesse (Opsci),
|
160 |
+
Christophe Cerisara (LORIA),
|
161 |
+
Evan Dufraisse (CEA),
|
162 |
+
Guokan Shang (MBZUAI),
|
163 |
+
Joël Gombin (Opsci),
|
164 |
+
Jordan Ricker (Opsci),
|
165 |
+
and
|
166 |
+
Olivier Ferret (CEA)
|
167 |
+
for their helpful input.
|
168 |
+
|
169 |
+
## Contact
|
170 |
+
|
171 |