juliehunter commited on
Commit
1923fd2
·
verified ·
1 Parent(s): f278bdf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -60,9 +60,8 @@ And the following datasets developed for the Lucie instruct models:
60
  * English: openllm_english.jsonl (24x10 samples)
61
 
62
  ### Preprocessing
63
- * Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only English and French samples, respectively.
64
  * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
65
- * Duplicate examples were removed from Open Assistant.
66
 
67
  ### Training procedure
68
 
 
60
  * English: openllm_english.jsonl (24x10 samples)
61
 
62
  ### Preprocessing
63
+ * Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
64
  * Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
 
65
 
66
  ### Training procedure
67