andreaskoepf
commited on
Commit
·
baeb4d9
1
Parent(s):
8c88680
add credits and pretokenizer configuration
Browse files
README.md
CHANGED
@@ -57,4 +57,53 @@ You are a helpful, respectful and honest assistant. Always answer as helpfully a
|
|
57 |
|
58 |
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
|
59 |
<|im_end|>
|
60 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
|
58 |
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
|
59 |
<|im_end|>
|
60 |
+
```
|
61 |
+
|
62 |
+
### Credits & Special Thanks
|
63 |
+
|
64 |
+
- Thanks to [Meta AI](https://ai.meta.com/) for training and releasing the CodeLLlama model.
|
65 |
+
- Distributed training support was provided by EPFL's [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/), and [Natural Language Processing Lab](https://nlp.epfl.ch/).
|
66 |
+
- The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
|
67 |
+
- [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
|
68 |
+
- [ehartford](https://huggingface.co/ehartford) generated and published the [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin).
|
69 |
+
- [shahules786](https://github.com/shahules786) de-duped and filtered the Dolphin and Megacode dataset with a clustering/controid approach and generated orca-best & bestofmegacode.
|
70 |
+
- [andreaskoepf](https://github.com/andreaskoepf/) prepared & orchestrated the training.
|
71 |
+
|
72 |
+
## Ethical Considerations and Limitations
|
73 |
+
|
74 |
+
Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
|
75 |
+
For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
|
76 |
+
in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
|
77 |
+
to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
|
78 |
+
perform safety testing and tuning tailored to their specific applications of the model.
|
79 |
+
|
80 |
+
Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
|
81 |
+
|
82 |
+
## Configuration Details
|
83 |
+
|
84 |
+
The "pretokenizer" utility used to tokenize the datamix is part of the Open-Assistant github repository and can be found here: [model/pretokenizer](https://github.com/LAION-AI/Open-Assistant/tree/main/model/pretokenizer).
|
85 |
+
|
86 |
+
|
87 |
+
### Pretokenizer Configuration
|
88 |
+
|
89 |
+
|
90 |
+
```
|
91 |
+
orca_megacode_oasst_best:
|
92 |
+
datasets:
|
93 |
+
- orca-chat:
|
94 |
+
val_split: 0.01
|
95 |
+
max_val_set: 1000
|
96 |
+
- bestofmegacode:
|
97 |
+
val_split: 0.01
|
98 |
+
max_val_set: 1000
|
99 |
+
- oasst_export:
|
100 |
+
lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
|
101 |
+
#hf_dataset_name: OpenAssistant/oasst1
|
102 |
+
input_file_path: 2023-08-25_oasst_ready.jsonl.gz
|
103 |
+
top_k: 1
|
104 |
+
val_split: 0.025
|
105 |
+
output_dir: "output/orca_megacode_oasst_best"
|
106 |
+
filename_prefix: "orca_megacode_oasst_best"
|
107 |
+
min_assistant_tokens: 1
|
108 |
+
```
|
109 |
+
|