juliehunter
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -60,9 +60,8 @@ And the following datasets developed for the Lucie instruct models:
|
|
60 |
* English: openllm_english.jsonl (24x10 samples)
|
61 |
|
62 |
### Preprocessing
|
63 |
-
* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only
|
64 |
* Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|
65 |
-
* Duplicate examples were removed from Open Assistant.
|
66 |
|
67 |
### Training procedure
|
68 |
|
|
|
60 |
* English: openllm_english.jsonl (24x10 samples)
|
61 |
|
62 |
### Preprocessing
|
63 |
+
* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
|
64 |
* Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|
|
|
65 |
|
66 |
### Training procedure
|
67 |
|