OpenAssistant
/

codellama-13b-oasst-sft-v10

@@ -57,4 +57,53 @@ You are a helpful, respectful and honest assistant. Always answer as helpfully a
 If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
 <|im_end|>
-```

 If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
 <|im_end|>
+```
+### Credits & Special Thanks
+- Thanks to [Meta AI](https://ai.meta.com/) for training and releasing the CodeLLlama model.
+- Distributed training support was provided by EPFL's [Machine Learning and Optimization Laboratory](https://www.epfl.ch/labs/mlo/), and [Natural Language Processing Lab](https://nlp.epfl.ch/).
+- The open-source [epfLLM/Megatron-LLM](https://github.com/epfLLM/Megatron-LLM) trainer was used for fine-tuning.
+- [rombodawg](https://huggingface.co/rombodawg) curated the [LosslessMegaCodeTrainingV2_1m_Evol_Uncensored](https://huggingface.co/datasets/rombodawg/LosslessMegaCodeTrainingV2_1m_Evol_Uncensored) dataset.
+- [ehartford](https://huggingface.co/ehartford) generated and published the [ehartford/dolphin](https://huggingface.co/datasets/ehartford/dolphin).
+- [shahules786](https://github.com/shahules786) de-duped and filtered the Dolphin and Megacode dataset with a clustering/controid approach and generated orca-best & bestofmegacode.
+- [andreaskoepf](https://github.com/andreaskoepf/) prepared & orchestrated the training.
+## Ethical Considerations and Limitations
+Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios.
+For these reasons, as with all LLMs, the potential outputs of llama2-70b-oasst-sft-v10 cannot be predicted
+in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
+to user prompts. Therefore, before deploying any applications of llama2-70b-oasst-sft-v10, developers should
+perform safety testing and tuning tailored to their specific applications of the model.
+Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
+## Configuration Details
+The "pretokenizer" utility used to tokenize the datamix is part of the Open-Assistant github repository and can be found here: [model/pretokenizer](https://github.com/LAION-AI/Open-Assistant/tree/main/model/pretokenizer).
+### Pretokenizer Configuration
+```
+orca_megacode_oasst_best:
+  datasets:
+    - orca-chat:
+        val_split: 0.01
+        max_val_set: 1000
+    - bestofmegacode:
+        val_split: 0.01
+        max_val_set: 1000
+    - oasst_export:
+        lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
+        #hf_dataset_name: OpenAssistant/oasst1
+        input_file_path: 2023-08-25_oasst_ready.jsonl.gz
+        top_k: 1
+        val_split: 0.025
+  output_dir: "output/orca_megacode_oasst_best"
+  filename_prefix: "orca_megacode_oasst_best"
+  min_assistant_tokens: 1
+```