OpenLLM-France
/

Lucie-7B-Instruct-human-data

Text Generation

Inference Endpoints

Model card Files Files and versions Community

juliehunter commited on 4 days ago

Commit

1923fd2

·

verified ·

1 Parent(s): f278bdf

Update README.md

Files changed (1) hide show

README.md +1 -2

README.md CHANGED Viewed

@@ -60,9 +60,8 @@ And the following datasets developed for the Lucie instruct models:
     * English: openllm_english.jsonl (24x10 samples)
 ### Preprocessing
-* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only English and French samples, respectively.
 * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
-* Duplicate examples were removed from Open Assistant.
 ### Training procedure

     * English: openllm_english.jsonl (24x10 samples)
 ### Preprocessing
+* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
 * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
 ### Training procedure