OpenLLM-France
/

Lucie-7B-Instruct-human-data

Text Generation

Inference Endpoints

Model card Files Files and versions Community

juliehunter commited on about 11 hours ago

Commit

cc725bd

·

verified ·

1 Parent(s): 065ee53

Update README.md

Files changed (1) hide show

README.md +28 -0

README.md CHANGED Viewed

@@ -24,6 +24,7 @@ pipeline_tag: text-generation
 * [Training Details](#training-details)
   * [Training Data](#training-data)
   * [Preprocessing](#preprocessing)
   * [Training Procedure](#training-procedure)
 <!-- * [Evaluation](#evaluation) -->
 * [Testing the model](#testing-the-model)
@@ -64,6 +65,33 @@ And the following datasets developed for the Lucie instruct models:
 * Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
 * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
 ### Training procedure
 The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:

 * [Training Details](#training-details)
   * [Training Data](#training-data)
   * [Preprocessing](#preprocessing)
+  * [Instruction template](#instruction-template)
   * [Training Procedure](#training-procedure)
 <!-- * [Evaluation](#evaluation) -->
 * [Testing the model](#testing-the-model)
 * Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
 * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
+### Instruction template:
+Lucie-7B-Instruct-human-data was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template:
+```
+<s><|start_header_id|>system<|end_header_id|>
+{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
+{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+{OUTPUT}<|eot_id|>
+```
+An example:
+```
+<s><|start_header_id|>system<|end_header_id|>
+You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
+Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|>
+```
 ### Training procedure
 The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions: