juliehunter commited on
Commit
cc725bd
·
verified ·
1 Parent(s): 065ee53

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md CHANGED
@@ -24,6 +24,7 @@ pipeline_tag: text-generation
24
  * [Training Details](#training-details)
25
  * [Training Data](#training-data)
26
  * [Preprocessing](#preprocessing)
 
27
  * [Training Procedure](#training-procedure)
28
  <!-- * [Evaluation](#evaluation) -->
29
  * [Testing the model](#testing-the-model)
@@ -64,6 +65,33 @@ And the following datasets developed for the Lucie instruct models:
64
  * Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
65
  * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ### Training procedure
68
 
69
  The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
 
24
  * [Training Details](#training-details)
25
  * [Training Data](#training-data)
26
  * [Preprocessing](#preprocessing)
27
+ * [Instruction template](#instruction-template)
28
  * [Training Procedure](#training-procedure)
29
  <!-- * [Evaluation](#evaluation) -->
30
  * [Testing the model](#testing-the-model)
 
65
  * Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
66
  * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
67
 
68
+ ### Instruction template:
69
+ Lucie-7B-Instruct-human-data was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template:
70
+
71
+ ```
72
+ <s><|start_header_id|>system<|end_header_id|>
73
+
74
+ {SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
75
+
76
+ {INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
77
+
78
+ {OUTPUT}<|eot_id|>
79
+ ```
80
+
81
+
82
+ An example:
83
+
84
+
85
+ ```
86
+ <s><|start_header_id|>system<|end_header_id|>
87
+
88
+ You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
89
+
90
+ Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
91
+
92
+ 1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|>
93
+ ```
94
+
95
  ### Training procedure
96
 
97
  The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions: