juliehunter commited on
Commit
874dfce
·
verified ·
1 Parent(s): 4eae8fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -3
README.md CHANGED
@@ -1,3 +1,171 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - CohereForAI/aya_dataset
5
+ - argilla/databricks-dolly-15k-curated-multilingual
6
+ - Gael540/dataSet_ens_sup_fr-v1
7
+ - ai2-adapt-dev/flan_v2_converted
8
+ - OpenAssistant/oasst1
9
+ language:
10
+ - fr
11
+ - en
12
+ - de
13
+ - it
14
+ - es
15
+ base_model:
16
+ - OpenLLM-France/Lucie-7B
17
+ pipeline_tag: text-generation
18
+ ---
19
+
20
+ # Model Card for Lucie-7B-Instruct-human-data
21
+
22
+ * [Model Description](#model-description)
23
+ <!-- * [Uses](#uses) -->
24
+ * [Training Details](#training-details)
25
+ * [Training Data](#training-data)
26
+ * [Preprocessing](#preprocessing)
27
+ * [Training Procedure](#training-procedure)
28
+ <!-- * [Evaluation](#evaluation) -->
29
+ * [Testing the model](#testing-the-model)
30
+ * [Test in python](#test-in-python)
31
+ * [Test with ollama](#test-with-ollama)
32
+ * [Test with vLLM](#test-with-vllm)
33
+ * [Citation](#citation)
34
+ * [Acknowledgements](#acknowledgements)
35
+ * [Contact](#contact)
36
+
37
+ ## Model Description
38
+
39
+ Lucie-7B-Instruct-human-data is a fine-tuned version of [Lucie-7B](), an open-source, multilingual causal language model created by OpenLLM-France.
40
+
41
+ Lucie-7B-Instruct-human-data is fine-tuned on human-produced instructions collected either from open annotation campaigns or by applying templates to extant datasets.
42
+
43
+
44
+ ## Training details
45
+ ### Training data
46
+
47
+ Lucie-7B-Instruct-human-data is trained on the following datasets published by third parties:
48
+ * [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) (English, 3944 samples; French, 1422; German, 241; Italian, 738; Spanish, 3854)
49
+ * [Dolly](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual) (English, French, German, Spanish; 15015 x 4 samples)
50
+ * [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples)
51
+ * [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78580 samples)
52
+ * [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1) (English, 21151 samples; French, 1223; German, 1515; Italian, 370; Spanish, 14078)
53
+ * [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4613 samples)
54
+ * [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1849 samples)
55
+
56
+
57
+ And the following datasets developed for the Lucie instruct models:
58
+ * Croissant Aligned Instruct (French-English, 20K examples sampled randomly from 80K total)
59
+ * Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x))
60
+ * French: openllm_french.jsonl (24x10 samples)
61
+ * English: openllm_english.jsonl (24x10 samples)
62
+
63
+ ### Preprocessing
64
+ * Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only English and French samples, respectively.
65
+ * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
66
+ * Duplicate examples were removed from Open Assistant.
67
+
68
+ ### Training procedure
69
+
70
+ The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
71
+ * context length: 4096
72
+ * batch size: 1024
73
+ * max learning rate: 3e-5
74
+ * min learning rate: 3e-6
75
+
76
+
77
+ ## Testing the model
78
+ ### Test in python
79
+
80
+ * [test_transformers_gguf.py](test_transformers_gguf.py): Test GGUF model with `transformers` package (WARNING: loading the model is long)
81
+
82
+ ### Test with ollama
83
+
84
+ * Download and install [Ollama](https://ollama.com/download)
85
+ * Download the [GGUF model](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1/resolve/main/Lucie-7B-q4_k_m.gguf)
86
+ * Copy the [`Modelfile`](Modelfile), adpating if necessary the path to the GGUF file (line starting with `FROM`).
87
+ * Run in a shell:
88
+ * `ollama create -f Modelfile Lucie`
89
+ * `ollama run Lucie`
90
+ * Once ">>>" appears, type your prompt(s) and press Enter.
91
+ * Optionally, restart a conversation by typing "`/clear`"
92
+ * End the session by typing "`/bye`".
93
+
94
+ Useful for debug:
95
+ * [How to print input requests and output responses in Ollama server?](https://stackoverflow.com/a/78831840)
96
+ * [Documentation on Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter)
97
+ * Examples: [Ollama model library](https://github.com/ollama/ollama#model-library)
98
+ * Llama 3 example: https://ollama.com/library/llama3.1
99
+ * Add GUI : https://docs.openwebui.com/
100
+
101
+ ### Test with vLLM
102
+
103
+ #### 1. Run vLLM Docker Container
104
+
105
+ Use the following command to deploy the model,
106
+ replacing `INSERT_YOUR_HF_TOKEN` with your Hugging Face Hub token.
107
+
108
+ ```bash
109
+ docker run --runtime nvidia --gpus=all \
110
+ --env "HUGGING_FACE_HUB_TOKEN=INSERT_YOUR_HF_TOKEN" \
111
+ -p 8000:8000 \
112
+ --ipc=host \
113
+ vllm/vllm-openai:latest \
114
+ --model OpenLLM-France/Lucie-7B-Instruct-v1
115
+ ```
116
+
117
+ #### 2. Test using OpenAI Client in Python
118
+
119
+ To test the deployed model, use the OpenAI Python client as follows:
120
+
121
+ ```python
122
+ from openai import OpenAI
123
+
124
+ # Initialize the client
125
+ client = OpenAI(base_url='http://localhost:8000/v1', api_key='empty')
126
+
127
+ # Define the input content
128
+ content = "Hello Lucie"
129
+
130
+ # Generate a response
131
+ chat_response = client.chat.completions.create(
132
+ model="OpenLLM-France/Lucie-7B-Instruct-v1",
133
+ messages=[
134
+ {"role": "user", "content": content}
135
+ ],
136
+ )
137
+ print(chat_response.choices[0].message.content)
138
+ ```
139
+
140
+ ## Citation
141
+
142
+ TODO
143
+
144
+ ## Acknowledgements
145
+
146
+ This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444).
147
+
148
+ Lucie-7B was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order:
149
+ Olivier Gouvert (LINAGORA),
150
+ Ismaïl Harrando (LINAGORA/SciencesPo),
151
+ Julie Hunter (LINAGORA),
152
+ Jean-Pierre Lorré (LINAGORA),
153
+ Jérôme Louradour (LINAGORA),
154
+ Michel-Marie Maudet (LINAGORA), and
155
+ Laura Rivière (LINAGORA).
156
+
157
+
158
+ We thank
159
+ Clément Bénesse (Opsci),
160
+ Christophe Cerisara (LORIA),
161
+ Evan Dufraisse (CEA),
162
+ Guokan Shang (MBZUAI),
163
+ Joël Gombin (Opsci),
164
+ Jordan Ricker (Opsci),
165
+ and
166
+ Olivier Ferret (CEA)
167
+ for their helpful input.
168
+
169
+ ## Contact
170
+
171