andreaskoepf commited on
Commit
baeb4d9
·
1 Parent(s): 8c88680

add credits and pretokenizer configuration

Browse files
Files changed (1) hide show
  1. README.md +50 -1
README.md CHANGED
@@ -57,4 +57,53 @@ You are a helpful, respectful and honest assistant. Always answer as helpfully a
57
 
58
  If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
59
  <|im_end|>
60
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
59
  <|im_end|>
60
+ ```
61
+
62
+ ### Credits & Special Thanks
63
+
64
+ - Thanks to [Meta AI](https://ai.meta.com/) for training and releasing the CodeLLlama model.
65
+ - Distributed training support was provided by EPFL's [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/), and [Natural Language Processing Lab](https://nlp.epfl.ch/).
66
+ - The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
67
+ - [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
68
+ - [ehartford](https://huggingface.co/ehartford) generated and published the [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin).
69
+ - [shahules786](https://github.com/shahules786) de-duped and filtered the Dolphin and Megacode dataset with a clustering/controid approach and generated orca-best & bestofmegacode.
70
+ - [andreaskoepf](https://github.com/andreaskoepf/) prepared & orchestrated the training.
71
+
72
+ ## Ethical Considerations and Limitations
73
+
74
+ Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
75
+ For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
76
+ in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
77
+ to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
78
+ perform safety testing and tuning tailored to their specific applications of the model.
79
+
80
+ Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
81
+
82
+ ## Configuration Details
83
+
84
+ The "pretokenizer" utility used to tokenize the datamix is part of the Open-Assistant github repository and can be found here: [model/pretokenizer](https://github.com/LAION-AI/Open-Assistant/tree/main/model/pretokenizer).
85
+
86
+
87
+ ### Pretokenizer Configuration
88
+
89
+
90
+ ```
91
+ orca_megacode_oasst_best:
92
+ datasets:
93
+ - orca-chat:
94
+ val_split: 0.01
95
+ max_val_set: 1000
96
+ - bestofmegacode:
97
+ val_split: 0.01
98
+ max_val_set: 1000
99
+ - oasst_export:
100
+ lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
101
+ #hf_dataset_name: OpenAssistant/oasst1
102
+ input_file_path: 2023-08-25_oasst_ready.jsonl.gz
103
+ top_k: 1
104
+ val_split: 0.025
105
+ output_dir: "output/orca_megacode_oasst_best"
106
+ filename_prefix: "orca_megacode_oasst_best"
107
+ min_assistant_tokens: 1
108
+ ```
109
+