--- license: other license_name: databricks-open-model-license license_link: https://www.databricks.com/legal/open-model-license base_model: databricks/dbrx-base tags: - generated_from_trainer - axolotl datasets: - cognitivecomputations/Dolphin-2.9 - teknium/OpenHermes-2.5 - m-a-p/CodeFeedback-Filtered-Instruction - cognitivecomputations/dolphin-coder - cognitivecomputations/samantha-data - microsoft/orca-math-word-problems-200k - Locutusque/function-calling-chatml - internlm/Agent-FLAN --- # GGUF fix This is the same model as [cognitivecomputations/dolphin-2.9.1-dbrx](https://huggingface.co/cognitivecomputations/dolphin-2.9.1-dbrx), but with [gguf fixes made by Kenjoyer](https://huggingface.co/Kenjoyer/dolphin-2.9.1-dbrx-llamacppfix) applied(thanks a lot!). This model can be converted into gguf using llama.cpp. # Benchmarks and personal opinion ### NeoEvalPlusN_benchmark [My meme benchmark.](https://huggingface.co/datasets/ChuckMcSneed/NeoEvalPlusN_benchmark) |Name |Quant|Size |B |C |D |S |P |total|BCD|SP | |---------------------------------------------|-----|------|---|---|---|----|----|-----|---|----| |cognitivecomputations/dolphin-2.9.1-dbrx |Q6_K |16x12B|3 |1 |3 |4 |6 |17 |7 |10 | |cognitivecomputations/dolphin-2.9.1-qwen-110b|Q6_K |110B |0 |1 |3 |3.75|4.25|12 |4 |8 | |databricks/dbrx-instruct |Q6_K |16x12B|0 |0 |0 |6.5 |4.5 |11 |0 |11 | |cognitivecomputations/dolphin-2.2-70b |Q6_K |70B |0 |1 |1 |4.5 |4.5 |11 |2 |9 | |Maximum |n/a |n/a |3 |2 |3 |8 |6 |22 |8 |14 | More compliant than the official instruct tune(BCD). To my surprise, performed much better overall than qwen-110b tuned on the same dataset. Wrote 6 perfect poems(P column), which is **very** unusual. Only models from goliath family and more recent llama-3-70b-instruct could do that. Stylized writing tests(S column) were a bit disappointing, Dolphin is not famous for that. In practical use, did perform better than the official tune. Still knows a lot, just like the official tune. Writing is not great, wouldn't use it over Command-r+, unless I need to know some obscure facts. Feels like quantization hurts it a lot more than dense models. Verdict: Meh, just like the other dolphins. Eric, no disrespect, but you need to get better datasets. GPTslop really hurts practical performance of the model. # Original model card below # Dolphin 2.9.1 DBRX 🐬 Curated and trained by Eric Hartford, Lucas Atkins, and Fernando Fernandes, and Cognitive Computations [![Discord](https://img.shields.io/discord/1156064224225808488?logo=Discord&logoColor=%23ffffff&label=Discord&link=https%3A%2F%2Fdiscord.gg%2FtCMkMDDHwm)](https://discord.gg/cognitivecomputations) Discord: https://discord.gg/cognitivecomputations Our appreciation for the sponsors of Dolphin 2.9.1: - [Crusoe Cloud](https://crusoe.ai/) - provided excellent on-demand 8xH100 node This model is based on [databricks/dbrx-base](https://huggingface.co/databricks/dbrx-base), and is governed by [databricks-open-model-license](https://www.databricks.com/legal/open-model-license) The base model has 32k context, and the full-weight fine-tuning was with 4k sequence length. This model was trained FFT on parameters selected by [Laser Scanner](https://github.com/cognitivecomputations/laserRMT/blob/main/laser_scanner.py), using ChatML prompt template format. example: ``` <|im_start|>system You are Dolphin, a helpful AI assistant.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant ``` Dolphin-2.9.1 has a variety of instruction, conversational, and coding skills. It also has initial agentic abilities and supports function calling. Dolphin is uncensored. We have filtered the dataset to remove alignment and bias. This makes the model more compliant. You are advised to implement your own alignment layer before exposing the model as a service. It will be highly compliant with any requests, even unethical ones. Please read my blog post about uncensored models. https://erichartford.com/uncensored-models You are responsible for any content you create using this model. Enjoy responsibly. Dolphin is licensed according to Meta's Llama license. We grant permission for any use, including commercial, that falls within accordance with Meta's Llama-3 license. Dolphin was trained on data generated from GPT4, among other models. ## Evals ![image/png](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/tVh5xVCGvjPyLgMCqp-IY.png) ## Training [Built with Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
See axolotl config axolotl version: `0.4.0` ```yaml base_model: /workspace/axolotl/dbrx-checkpoint model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: true load_in_8bit: false # load_in_4bit: true strict: false # adapter: qlora # lora_modules_to_save: [embed_tokens, lm_head] # lora_r: 32 # lora_alpha: 16 # lora_dropout: 0.05 # lora_target_linear: false # lora_fan_in_fan_out: datasets: - path: /workspace/datasets/dolphin-2.9/dolphin201-sharegpt2.jsonl type: sharegpt conversation: chatml # - path: /workspace/datasets/dolphin-2.9/Ultrachat200kunfiltered.jsonl # type: sharegpt # conversation: chatml - path: /workspace/datasets/dolphin-2.9/dolphin-coder-translate-sharegpt2.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/dolphin-coder-codegen-sharegpt2.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/m-a-p_Code-Feedback-sharegpt-unfiltered.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/m-a-p_CodeFeedback-Filtered-Instruction-sharegpt-unfiltered.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/not_samantha_norefusals.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/Orca-Math-resort-unfiltered.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/agent_instruct_react_unfiltered.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/toolbench_instruct_j1s1_3k_unfiltered.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/toolbench_negative_unfiltered.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/toolbench_react_10p_unfiltered.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/toolbench_tflan_cot_30p_unfiltered.jsonl type: sharegpt conversation: chatml - path: /workspace/datasets/dolphin-2.9/openhermes200k_unfiltered.jsonl type: sharegpt conversation: chatml # - path: /workspace/datasets/dolphin-2.9/SystemConversations.jsonl # type: sharegpt # conversation: chatml chat_template: chatml unfrozen_parameters: - ^lm_head.weight$ # ffn.experts.mlp_experts.0.v1 layers - transformer.blocks.30.ffn.experts.mlp_experts.0.v1 - transformer.blocks.32.ffn.experts.mlp_experts.0.v1 - transformer.blocks.25.ffn.experts.mlp_experts.0.v1 - transformer.blocks.15.ffn.experts.mlp_experts.0.v1 - transformer.blocks.22.ffn.experts.mlp_experts.0.v1 - transformer.blocks.31.ffn.experts.mlp_experts.0.v1 - transformer.blocks.7.ffn.experts.mlp_experts.0.v1 - transformer.blocks.21.ffn.experts.mlp_experts.0.v1 - transformer.blocks.8.ffn.experts.mlp_experts.0.v1 - transformer.blocks.23.ffn.experts.mlp_experts.0.v1 # ffn.experts.mlp_experts.0.w1 layers - transformer.blocks.7.ffn.experts.mlp_experts.0.w1 - transformer.blocks.8.ffn.experts.mlp_experts.0.w1 - transformer.blocks.30.ffn.experts.mlp_experts.0.w1 - transformer.blocks.4.ffn.experts.mlp_experts.0.w1 - transformer.blocks.0.ffn.experts.mlp_experts.0.w1 - transformer.blocks.32.ffn.experts.mlp_experts.0.w1 - transformer.blocks.6.ffn.experts.mlp_experts.0.w1 - transformer.blocks.3.ffn.experts.mlp_experts.0.w1 - transformer.blocks.25.ffn.experts.mlp_experts.0.w1 - transformer.blocks.5.ffn.experts.mlp_experts.0.w1 # ffn.experts.mlp_experts.0.w2 layers - transformer.blocks.25.ffn.experts.mlp_experts.0.w2 - transformer.blocks.22.ffn.experts.mlp_experts.0.w2 - transformer.blocks.27.ffn.experts.mlp_experts.0.w2 - transformer.blocks.26.ffn.experts.mlp_experts.0.w2 - transformer.blocks.4.ffn.experts.mlp_experts.0.w2 - transformer.blocks.29.ffn.experts.mlp_experts.0.w2 - transformer.blocks.32.ffn.experts.mlp_experts.0.w2 - transformer.blocks.5.ffn.experts.mlp_experts.0.w2 - transformer.blocks.7.ffn.experts.mlp_experts.0.w2 - transformer.blocks.3.ffn.experts.mlp_experts.0.w2 # ffn.experts.mlp_experts.1.v1 layers - transformer.blocks.27.ffn.experts.mlp_experts.1.v1 - transformer.blocks.25.ffn.experts.mlp_experts.1.v1 - transformer.blocks.29.ffn.experts.mlp_experts.1.v1 - transformer.blocks.33.ffn.experts.mlp_experts.1.v1 - transformer.blocks.23.ffn.experts.mlp_experts.1.v1 - transformer.blocks.30.ffn.experts.mlp_experts.1.v1 - transformer.blocks.6.ffn.experts.mlp_experts.1.v1 - transformer.blocks.21.ffn.experts.mlp_experts.1.v1 - transformer.blocks.15.ffn.experts.mlp_experts.1.v1 - transformer.blocks.7.ffn.experts.mlp_experts.1.v1 # ffn.experts.mlp_experts.1.w1 layers - transformer.blocks.0.ffn.experts.mlp_experts.1.w1 - transformer.blocks.6.ffn.experts.mlp_experts.1.w1 - transformer.blocks.7.ffn.experts.mlp_experts.1.w1 - transformer.blocks.4.ffn.experts.mlp_experts.1.w1 - transformer.blocks.8.ffn.experts.mlp_experts.1.w1 - transformer.blocks.29.ffn.experts.mlp_experts.1.w1 - transformer.blocks.33.ffn.experts.mlp_experts.1.w1 - transformer.blocks.27.ffn.experts.mlp_experts.1.w1 - transformer.blocks.1.ffn.experts.mlp_experts.1.w1 - transformer.blocks.10.ffn.experts.mlp_experts.1.w1 # ffn.experts.mlp_experts.1.w2 layers - transformer.blocks.25.ffn.experts.mlp_experts.1.w2 - transformer.blocks.23.ffn.experts.mlp_experts.1.w2 - transformer.blocks.27.ffn.experts.mlp_experts.1.w2 - transformer.blocks.29.ffn.experts.mlp_experts.1.w2 - transformer.blocks.31.ffn.experts.mlp_experts.1.w2 - transformer.blocks.4.ffn.experts.mlp_experts.1.w2 - transformer.blocks.32.ffn.experts.mlp_experts.1.w2 - transformer.blocks.30.ffn.experts.mlp_experts.1.w2 - transformer.blocks.21.ffn.experts.mlp_experts.1.w2 - transformer.blocks.33.ffn.experts.mlp_experts.1.w2 # ffn.experts.mlp_experts.10.v1 layers - transformer.blocks.28.ffn.experts.mlp_experts.10.v1 - transformer.blocks.34.ffn.experts.mlp_experts.10.v1 - transformer.blocks.33.ffn.experts.mlp_experts.10.v1 - transformer.blocks.26.ffn.experts.mlp_experts.10.v1 - transformer.blocks.32.ffn.experts.mlp_experts.10.v1 - transformer.blocks.30.ffn.experts.mlp_experts.10.v1 - transformer.blocks.36.ffn.experts.mlp_experts.10.v1 - transformer.blocks.24.ffn.experts.mlp_experts.10.v1 - transformer.blocks.20.ffn.experts.mlp_experts.10.v1 - transformer.blocks.35.ffn.experts.mlp_experts.10.v1 # ffn.experts.mlp_experts.10.w1 layers - transformer.blocks.24.ffn.experts.mlp_experts.10.w1 - transformer.blocks.33.ffn.experts.mlp_experts.10.w1 - transformer.blocks.8.ffn.experts.mlp_experts.10.w1 - transformer.blocks.7.ffn.experts.mlp_experts.10.w1 - transformer.blocks.34.ffn.experts.mlp_experts.10.w1 - transformer.blocks.28.ffn.experts.mlp_experts.10.w1 - transformer.blocks.30.ffn.experts.mlp_experts.10.w1 - transformer.blocks.1.ffn.experts.mlp_experts.10.w1 - transformer.blocks.3.ffn.experts.mlp_experts.10.w1 - transformer.blocks.5.ffn.experts.mlp_experts.10.w1 # ffn.experts.mlp_experts.10.w2 layers - transformer.blocks.24.ffn.experts.mlp_experts.10.w2 - transformer.blocks.28.ffn.experts.mlp_experts.10.w2 - transformer.blocks.23.ffn.experts.mlp_experts.10.w2 - transformer.blocks.30.ffn.experts.mlp_experts.10.w2 - transformer.blocks.32.ffn.experts.mlp_experts.10.w2 - transformer.blocks.3.ffn.experts.mlp_experts.10.w2 - transformer.blocks.33.ffn.experts.mlp_experts.10.w2 - transformer.blocks.26.ffn.experts.mlp_experts.10.w2 - transformer.blocks.2.ffn.experts.mlp_experts.10.w2 - transformer.blocks.20.ffn.experts.mlp_experts.10.w2 # ffn.experts.mlp_experts.11.w1 layers - transformer.blocks.6.ffn.experts.mlp_experts.11.w1 - transformer.blocks.8.ffn.experts.mlp_experts.11.w1 - transformer.blocks.9.ffn.experts.mlp_experts.11.w1 - transformer.blocks.0.ffn.experts.mlp_experts.11.w1 - transformer.blocks.10.ffn.experts.mlp_experts.11.w1 - transformer.blocks.28.ffn.experts.mlp_experts.11.w1 - transformer.blocks.3.ffn.experts.mlp_experts.11.w1 - transformer.blocks.5.ffn.experts.mlp_experts.11.w1 - transformer.blocks.33.ffn.experts.mlp_experts.11.w1 - transformer.blocks.13.ffn.experts.mlp_experts.11.w1 # ffn.experts.mlp_experts.11.w2 layers - transformer.blocks.27.ffn.experts.mlp_experts.11.w2 - transformer.blocks.24.ffn.experts.mlp_experts.11.w2 - transformer.blocks.29.ffn.experts.mlp_experts.11.w2 - transformer.blocks.30.ffn.experts.mlp_experts.11.w2 - transformer.blocks.22.ffn.experts.mlp_experts.11.w2 - transformer.blocks.6.ffn.experts.mlp_experts.11.w2 - transformer.blocks.25.ffn.experts.mlp_experts.11.w2 - transformer.blocks.7.ffn.experts.mlp_experts.11.w2 - transformer.blocks.28.ffn.experts.mlp_experts.11.w2 - transformer.blocks.5.ffn.experts.mlp_experts.11.w2 # ffn.experts.mlp_experts.12.v1 layers - transformer.blocks.30.ffn.experts.mlp_experts.12.v1 - transformer.blocks.21.ffn.experts.mlp_experts.12.v1 - transformer.blocks.27.ffn.experts.mlp_experts.12.v1 - transformer.blocks.28.ffn.experts.mlp_experts.12.v1 - transformer.blocks.29.ffn.experts.mlp_experts.12.v1 - transformer.blocks.8.ffn.experts.mlp_experts.12.v1 - transformer.blocks.10.ffn.experts.mlp_experts.12.v1 - transformer.blocks.23.ffn.experts.mlp_experts.12.v1 - transformer.blocks.6.ffn.experts.mlp_experts.12.v1 - transformer.blocks.20.ffn.experts.mlp_experts.12.v1 # ffn.experts.mlp_experts.12.w1 layers - transformer.blocks.8.ffn.experts.mlp_experts.12.w1 - transformer.blocks.1.ffn.experts.mlp_experts.12.w1 - transformer.blocks.0.ffn.experts.mlp_experts.12.w1 - transformer.blocks.6.ffn.experts.mlp_experts.12.w1 - transformer.blocks.9.ffn.experts.mlp_experts.12.w1 - transformer.blocks.2.ffn.experts.mlp_experts.12.w1 - transformer.blocks.10.ffn.experts.mlp_experts.12.w1 - transformer.blocks.17.ffn.experts.mlp_experts.12.w1 - transformer.blocks.29.ffn.experts.mlp_experts.12.w1 - transformer.blocks.21.ffn.experts.mlp_experts.12.w1 # ffn.experts.mlp_experts.12.w2 layers - transformer.blocks.6.ffn.experts.mlp_experts.12.w2 - transformer.blocks.25.ffn.experts.mlp_experts.12.w2 - transformer.blocks.27.ffn.experts.mlp_experts.12.w2 - transformer.blocks.8.ffn.experts.mlp_experts.12.w2 - transformer.blocks.31.ffn.experts.mlp_experts.12.w2 - transformer.blocks.21.ffn.experts.mlp_experts.12.w2 - transformer.blocks.2.ffn.experts.mlp_experts.12.w2 - transformer.blocks.29.ffn.experts.mlp_experts.12.w2 - transformer.blocks.32.ffn.experts.mlp_experts.12.w2 - transformer.blocks.30.ffn.experts.mlp_experts.12.w2 # ffn.experts.mlp_experts.13.v1 layers - transformer.blocks.31.ffn.experts.mlp_experts.13.v1 - transformer.blocks.24.ffn.experts.mlp_experts.13.v1 - transformer.blocks.30.ffn.experts.mlp_experts.13.v1 - transformer.blocks.29.ffn.experts.mlp_experts.13.v1 - transformer.blocks.8.ffn.experts.mlp_experts.13.v1 - transformer.blocks.10.ffn.experts.mlp_experts.13.v1 - transformer.blocks.11.ffn.experts.mlp_experts.13.v1 - transformer.blocks.27.ffn.experts.mlp_experts.13.v1 - transformer.blocks.25.ffn.experts.mlp_experts.13.v1 - transformer.blocks.36.ffn.experts.mlp_experts.13.v1 # ffn.experts.mlp_experts.13.w1 layers - transformer.blocks.4.ffn.experts.mlp_experts.13.w1 - transformer.blocks.10.ffn.experts.mlp_experts.13.w1 - transformer.blocks.6.ffn.experts.mlp_experts.13.w1 - transformer.blocks.0.ffn.experts.mlp_experts.13.w1 - transformer.blocks.3.ffn.experts.mlp_experts.13.w1 - transformer.blocks.24.ffn.experts.mlp_experts.13.w1 - transformer.blocks.8.ffn.experts.mlp_experts.13.w1 - transformer.blocks.1.ffn.experts.mlp_experts.13.w1 - transformer.blocks.30.ffn.experts.mlp_experts.13.w1 - transformer.blocks.11.ffn.experts.mlp_experts.13.w1 # ffn.experts.mlp_experts.13.w2 layers - transformer.blocks.24.ffn.experts.mlp_experts.13.w2 - transformer.blocks.20.ffn.experts.mlp_experts.13.w2 - transformer.blocks.25.ffn.experts.mlp_experts.13.w2 - transformer.blocks.27.ffn.experts.mlp_experts.13.w2 - transformer.blocks.3.ffn.experts.mlp_experts.13.w2 - transformer.blocks.4.ffn.experts.mlp_experts.13.w2 - transformer.blocks.29.ffn.experts.mlp_experts.13.w2 - transformer.blocks.6.ffn.experts.mlp_experts.13.w2 - transformer.blocks.30.ffn.experts.mlp_experts.13.w2 - transformer.blocks.31.ffn.experts.mlp_experts.13.w2 # ffn.experts.mlp_experts.14.v1 layers - transformer.blocks.28.ffn.experts.mlp_experts.14.v1 - transformer.blocks.26.ffn.experts.mlp_experts.14.v1 - transformer.blocks.29.ffn.experts.mlp_experts.14.v1 - transformer.blocks.35.ffn.experts.mlp_experts.14.v1 - transformer.blocks.24.ffn.experts.mlp_experts.14.v1 - transformer.blocks.8.ffn.experts.mlp_experts.14.v1 - transformer.blocks.32.ffn.experts.mlp_experts.14.v1 - transformer.blocks.15.ffn.experts.mlp_experts.14.v1 - transformer.blocks.11.ffn.experts.mlp_experts.14.v1 - transformer.blocks.22.ffn.experts.mlp_experts.14.v1 # ffn.experts.mlp_experts.14.w1 layers - transformer.blocks.8.ffn.experts.mlp_experts.14.w1 - transformer.blocks.4.ffn.experts.mlp_experts.14.w1 - transformer.blocks.5.ffn.experts.mlp_experts.14.w1 - transformer.blocks.7.ffn.experts.mlp_experts.14.w1 - transformer.blocks.3.ffn.experts.mlp_experts.14.w1 - transformer.blocks.13.ffn.experts.mlp_experts.14.w1 - transformer.blocks.29.ffn.experts.mlp_experts.14.w1 - transformer.blocks.6.ffn.experts.mlp_experts.14.w1 - transformer.blocks.28.ffn.experts.mlp_experts.14.w1 - transformer.blocks.9.ffn.experts.mlp_experts.14.w1 # ffn.experts.mlp_experts.14.w2 layers - transformer.blocks.26.ffn.experts.mlp_experts.14.w2 - transformer.blocks.24.ffn.experts.mlp_experts.14.w2 - transformer.blocks.29.ffn.experts.mlp_experts.14.w2 - transformer.blocks.28.ffn.experts.mlp_experts.14.w2 - transformer.blocks.31.ffn.experts.mlp_experts.14.w2 - transformer.blocks.5.ffn.experts.mlp_experts.14.w2 - transformer.blocks.4.ffn.experts.mlp_experts.14.w2 - transformer.blocks.32.ffn.experts.mlp_experts.14.w2 - transformer.blocks.6.ffn.experts.mlp_experts.14.w2 - transformer.blocks.22.ffn.experts.mlp_experts.14.w2 # ffn.experts.mlp_experts.15.v1 layers - transformer.blocks.33.ffn.experts.mlp_experts.15.v1 - transformer.blocks.26.ffn.experts.mlp_experts.15.v1 - transformer.blocks.31.ffn.experts.mlp_experts.15.v1 - transformer.blocks.28.ffn.experts.mlp_experts.15.v1 - transformer.blocks.9.ffn.experts.mlp_experts.15.v1 - transformer.blocks.34.ffn.experts.mlp_experts.15.v1 - transformer.blocks.29.ffn.experts.mlp_experts.15.v1 - transformer.blocks.7.ffn.experts.mlp_experts.15.v1 - transformer.blocks.17.ffn.experts.mlp_experts.15.v1 - transformer.blocks.15.ffn.experts.mlp_experts.15.v1 # ffn.experts.mlp_experts.15.w1 layers - transformer.blocks.6.ffn.experts.mlp_experts.15.w1 - transformer.blocks.9.ffn.experts.mlp_experts.15.w1 - transformer.blocks.0.ffn.experts.mlp_experts.15.w1 - transformer.blocks.7.ffn.experts.mlp_experts.15.w1 - transformer.blocks.14.ffn.experts.mlp_experts.15.w1 - transformer.blocks.33.ffn.experts.mlp_experts.15.w1 - transformer.blocks.34.ffn.experts.mlp_experts.15.w1 - transformer.blocks.10.ffn.experts.mlp_experts.15.w1 - transformer.blocks.5.ffn.experts.mlp_experts.15.w1 - transformer.blocks.29.ffn.experts.mlp_experts.15.w1 # ffn.experts.mlp_experts.15.w2 layers - transformer.blocks.28.ffn.experts.mlp_experts.15.w2 - transformer.blocks.26.ffn.experts.mlp_experts.15.w2 - transformer.blocks.27.ffn.experts.mlp_experts.15.w2 - transformer.blocks.29.ffn.experts.mlp_experts.15.w2 - transformer.blocks.6.ffn.experts.mlp_experts.15.w2 - transformer.blocks.31.ffn.experts.mlp_experts.15.w2 - transformer.blocks.7.ffn.experts.mlp_experts.15.w2 - transformer.blocks.33.ffn.experts.mlp_experts.15.w2 - transformer.blocks.32.ffn.experts.mlp_experts.15.w2 - transformer.blocks.25.ffn.experts.mlp_experts.15.w2 # ffn.experts.mlp_experts.2.v1 layers - transformer.blocks.31.ffn.experts.mlp_experts.2.v1 - transformer.blocks.27.ffn.experts.mlp_experts.2.v1 - transformer.blocks.28.ffn.experts.mlp_experts.2.v1 - transformer.blocks.30.ffn.experts.mlp_experts.2.v1 - transformer.blocks.23.ffn.experts.mlp_experts.2.v1 - transformer.blocks.32.ffn.experts.mlp_experts.2.v1 - transformer.blocks.35.ffn.experts.mlp_experts.2.v1 - transformer.blocks.7.ffn.experts.mlp_experts.2.v1 - transformer.blocks.21.ffn.experts.mlp_experts.2.v1 - transformer.blocks.15.ffn.experts.mlp_experts.2.v1 # ffn.experts.mlp_experts.2.w1 layers - transformer.blocks.7.ffn.experts.mlp_experts.2.w1 - transformer.blocks.6.ffn.experts.mlp_experts.2.w1 - transformer.blocks.1.ffn.experts.mlp_experts.2.w1 - transformer.blocks.4.ffn.experts.mlp_experts.2.w1 - transformer.blocks.5.ffn.experts.mlp_experts.2.w1 - transformer.blocks.29.ffn.experts.mlp_experts.2.w1 - transformer.blocks.0.ffn.experts.mlp_experts.2.w1 - transformer.blocks.9.ffn.experts.mlp_experts.2.w1 - transformer.blocks.31.ffn.experts.mlp_experts.2.w1 - transformer.blocks.30.ffn.experts.mlp_experts.2.w1 # ffn.experts.mlp_experts.2.w2 layers - transformer.blocks.26.ffn.experts.mlp_experts.2.w2 - transformer.blocks.27.ffn.experts.mlp_experts.2.w2 - transformer.blocks.33.ffn.experts.mlp_experts.2.w2 - transformer.blocks.5.ffn.experts.mlp_experts.2.w2 - transformer.blocks.23.ffn.experts.mlp_experts.2.w2 - transformer.blocks.32.ffn.experts.mlp_experts.2.w2 - transformer.blocks.28.ffn.experts.mlp_experts.2.w2 - transformer.blocks.4.ffn.experts.mlp_experts.2.w2 - transformer.blocks.29.ffn.experts.mlp_experts.2.w2 - transformer.blocks.30.ffn.experts.mlp_experts.2.w2 # ffn.experts.mlp_experts.3.v1 layers - transformer.blocks.28.ffn.experts.mlp_experts.3.v1 - transformer.blocks.33.ffn.experts.mlp_experts.3.v1 - transformer.blocks.36.ffn.experts.mlp_experts.3.v1 - transformer.blocks.29.ffn.experts.mlp_experts.3.v1 - transformer.blocks.30.ffn.experts.mlp_experts.3.v1 - transformer.blocks.7.ffn.experts.mlp_experts.3.v1 - transformer.blocks.14.ffn.experts.mlp_experts.3.v1 - transformer.blocks.10.ffn.experts.mlp_experts.3.v1 - transformer.blocks.31.ffn.experts.mlp_experts.3.v1 - transformer.blocks.21.ffn.experts.mlp_experts.3.v1 # ffn.experts.mlp_experts.3.w1 layers - transformer.blocks.7.ffn.experts.mlp_experts.3.w1 - transformer.blocks.0.ffn.experts.mlp_experts.3.w1 - transformer.blocks.10.ffn.experts.mlp_experts.3.w1 - transformer.blocks.9.ffn.experts.mlp_experts.3.w1 - transformer.blocks.29.ffn.experts.mlp_experts.3.w1 - transformer.blocks.5.ffn.experts.mlp_experts.3.w1 - transformer.blocks.30.ffn.experts.mlp_experts.3.w1 - transformer.blocks.4.ffn.experts.mlp_experts.3.w1 - transformer.blocks.33.ffn.experts.mlp_experts.3.w1 - transformer.blocks.1.ffn.experts.mlp_experts.3.w1 # ffn.experts.mlp_experts.3.w2 layers - transformer.blocks.28.ffn.experts.mlp_experts.3.w2 - transformer.blocks.5.ffn.experts.mlp_experts.3.w2 - transformer.blocks.24.ffn.experts.mlp_experts.3.w2 - transformer.blocks.31.ffn.experts.mlp_experts.3.w2 - transformer.blocks.30.ffn.experts.mlp_experts.3.w2 - transformer.blocks.21.ffn.experts.mlp_experts.3.w2 - transformer.blocks.32.ffn.experts.mlp_experts.3.w2 - transformer.blocks.29.ffn.experts.mlp_experts.3.w2 - transformer.blocks.26.ffn.experts.mlp_experts.3.w2 - transformer.blocks.2.ffn.experts.mlp_experts.3.w2 # ffn.experts.mlp_experts.4.v1 layers - transformer.blocks.34.ffn.experts.mlp_experts.4.v1 - transformer.blocks.31.ffn.experts.mlp_experts.4.v1 - transformer.blocks.26.ffn.experts.mlp_experts.4.v1 - transformer.blocks.24.ffn.experts.mlp_experts.4.v1 - transformer.blocks.14.ffn.experts.mlp_experts.4.v1 - transformer.blocks.32.ffn.experts.mlp_experts.4.v1 - transformer.blocks.7.ffn.experts.mlp_experts.4.v1 - transformer.blocks.6.ffn.experts.mlp_experts.4.v1 - transformer.blocks.20.ffn.experts.mlp_experts.4.v1 - transformer.blocks.9.ffn.experts.mlp_experts.4.v1 # ffn.experts.mlp_experts.4.w1 layers - transformer.blocks.6.ffn.experts.mlp_experts.4.w1 - transformer.blocks.4.ffn.experts.mlp_experts.4.w1 - transformer.blocks.7.ffn.experts.mlp_experts.4.w1 - transformer.blocks.9.ffn.experts.mlp_experts.4.w1 - transformer.blocks.0.ffn.experts.mlp_experts.4.w1 - transformer.blocks.5.ffn.experts.mlp_experts.4.w1 - transformer.blocks.14.ffn.experts.mlp_experts.4.w1 - transformer.blocks.34.ffn.experts.mlp_experts.4.w1 - transformer.blocks.8.ffn.experts.mlp_experts.4.w1 - transformer.blocks.29.ffn.experts.mlp_experts.4.w1 # ffn.experts.mlp_experts.4.w2 layers - transformer.blocks.25.ffn.experts.mlp_experts.4.w2 - transformer.blocks.24.ffn.experts.mlp_experts.4.w2 - transformer.blocks.26.ffn.experts.mlp_experts.4.w2 - transformer.blocks.5.ffn.experts.mlp_experts.4.w2 - transformer.blocks.6.ffn.experts.mlp_experts.4.w2 - transformer.blocks.32.ffn.experts.mlp_experts.4.w2 - transformer.blocks.4.ffn.experts.mlp_experts.4.w2 - transformer.blocks.36.ffn.experts.mlp_experts.4.w2 - transformer.blocks.29.ffn.experts.mlp_experts.4.w2 - transformer.blocks.27.ffn.experts.mlp_experts.4.w2 # ffn.experts.mlp_experts.5.v1 layers - transformer.blocks.35.ffn.experts.mlp_experts.5.v1 - transformer.blocks.30.ffn.experts.mlp_experts.5.v1 - transformer.blocks.28.ffn.experts.mlp_experts.5.v1 - transformer.blocks.32.ffn.experts.mlp_experts.5.v1 - transformer.blocks.27.ffn.experts.mlp_experts.5.v1 - transformer.blocks.26.ffn.experts.mlp_experts.5.v1 - transformer.blocks.33.ffn.experts.mlp_experts.5.v1 - transformer.blocks.29.ffn.experts.mlp_experts.5.v1 - transformer.blocks.8.ffn.experts.mlp_experts.5.v1 - transformer.blocks.7.ffn.experts.mlp_experts.5.v1 # ffn.experts.mlp_experts.5.w1 layers - transformer.blocks.0.ffn.experts.mlp_experts.5.w1 - transformer.blocks.6.ffn.experts.mlp_experts.5.w1 - transformer.blocks.7.ffn.experts.mlp_experts.5.w1 - transformer.blocks.9.ffn.experts.mlp_experts.5.w1 - transformer.blocks.8.ffn.experts.mlp_experts.5.w1 - transformer.blocks.12.ffn.experts.mlp_experts.5.w1 - transformer.blocks.3.ffn.experts.mlp_experts.5.w1 - transformer.blocks.5.ffn.experts.mlp_experts.5.w1 - transformer.blocks.4.ffn.experts.mlp_experts.5.w1 - transformer.blocks.33.ffn.experts.mlp_experts.5.w1 # ffn.experts.mlp_experts.5.w2 layers - transformer.blocks.26.ffn.experts.mlp_experts.5.w2 - transformer.blocks.28.ffn.experts.mlp_experts.5.w2 - transformer.blocks.6.ffn.experts.mlp_experts.5.w2 - transformer.blocks.33.ffn.experts.mlp_experts.5.w2 - transformer.blocks.5.ffn.experts.mlp_experts.5.w2 - transformer.blocks.27.ffn.experts.mlp_experts.5.w2 - transformer.blocks.3.ffn.experts.mlp_experts.5.w2 - transformer.blocks.29.ffn.experts.mlp_experts.5.w2 - transformer.blocks.25.ffn.experts.mlp_experts.5.w2 - transformer.blocks.7.ffn.experts.mlp_experts.5.w2 # ffn.experts.mlp_experts.6.v1 layers - transformer.blocks.34.ffn.experts.mlp_experts.6.v1 - transformer.blocks.31.ffn.experts.mlp_experts.6.v1 - transformer.blocks.30.ffn.experts.mlp_experts.6.v1 - transformer.blocks.26.ffn.experts.mlp_experts.6.v1 - transformer.blocks.35.ffn.experts.mlp_experts.6.v1 - transformer.blocks.20.ffn.experts.mlp_experts.6.v1 - transformer.blocks.15.ffn.experts.mlp_experts.6.v1 - transformer.blocks.29.ffn.experts.mlp_experts.6.v1 - transformer.blocks.10.ffn.experts.mlp_experts.6.v1 - transformer.blocks.24.ffn.experts.mlp_experts.6.v1 # ffn.experts.mlp_experts.6.w1 layers - transformer.blocks.0.ffn.experts.mlp_experts.6.w1 - transformer.blocks.10.ffn.experts.mlp_experts.6.w1 - transformer.blocks.9.ffn.experts.mlp_experts.6.w1 - transformer.blocks.30.ffn.experts.mlp_experts.6.w1 - transformer.blocks.4.ffn.experts.mlp_experts.6.w1 - transformer.blocks.34.ffn.experts.mlp_experts.6.w1 - transformer.blocks.26.ffn.experts.mlp_experts.6.w1 - transformer.blocks.2.ffn.experts.mlp_experts.6.w1 - transformer.blocks.29.ffn.experts.mlp_experts.6.w1 - transformer.blocks.8.ffn.experts.mlp_experts.6.w1 # ffn.experts.mlp_experts.6.w2 layers - transformer.blocks.24.ffn.experts.mlp_experts.6.w2 - transformer.blocks.26.ffn.experts.mlp_experts.6.w2 - transformer.blocks.32.ffn.experts.mlp_experts.6.w2 - transformer.blocks.30.ffn.experts.mlp_experts.6.w2 - transformer.blocks.25.ffn.experts.mlp_experts.6.w2 - transformer.blocks.31.ffn.experts.mlp_experts.6.w2 - transformer.blocks.20.ffn.experts.mlp_experts.6.w2 - transformer.blocks.4.ffn.experts.mlp_experts.6.w2 - transformer.blocks.2.ffn.experts.mlp_experts.6.w2 - transformer.blocks.9.ffn.experts.mlp_experts.6.w2 # ffn.experts.mlp_experts.7.v1 layers - transformer.blocks.27.ffn.experts.mlp_experts.7.v1 - transformer.blocks.28.ffn.experts.mlp_experts.7.v1 - transformer.blocks.33.ffn.experts.mlp_experts.7.v1 - transformer.blocks.29.ffn.experts.mlp_experts.7.v1 - transformer.blocks.24.ffn.experts.mlp_experts.7.v1 - transformer.blocks.11.ffn.experts.mlp_experts.7.v1 - transformer.blocks.12.ffn.experts.mlp_experts.7.v1 - transformer.blocks.10.ffn.experts.mlp_experts.7.v1 - transformer.blocks.23.ffn.experts.mlp_experts.7.v1 - transformer.blocks.34.ffn.experts.mlp_experts.7.v1 # ffn.experts.mlp_experts.7.w1 layers - transformer.blocks.12.ffn.experts.mlp_experts.7.w1 - transformer.blocks.0.ffn.experts.mlp_experts.7.w1 - transformer.blocks.5.ffn.experts.mlp_experts.7.w1 - transformer.blocks.29.ffn.experts.mlp_experts.7.w1 - transformer.blocks.10.ffn.experts.mlp_experts.7.w1 - transformer.blocks.4.ffn.experts.mlp_experts.7.w1 - transformer.blocks.3.ffn.experts.mlp_experts.7.w1 - transformer.blocks.8.ffn.experts.mlp_experts.7.w1 - transformer.blocks.34.ffn.experts.mlp_experts.7.w1 - transformer.blocks.33.ffn.experts.mlp_experts.7.w1 # ffn.experts.mlp_experts.7.w2 layers - transformer.blocks.23.ffn.experts.mlp_experts.7.w2 - transformer.blocks.24.ffn.experts.mlp_experts.7.w2 - transformer.blocks.31.ffn.experts.mlp_experts.7.w2 - transformer.blocks.28.ffn.experts.mlp_experts.7.w2 - transformer.blocks.27.ffn.experts.mlp_experts.7.w2 - transformer.blocks.5.ffn.experts.mlp_experts.7.w2 - transformer.blocks.25.ffn.experts.mlp_experts.7.w2 - transformer.blocks.29.ffn.experts.mlp_experts.7.w2 - transformer.blocks.3.ffn.experts.mlp_experts.7.w2 - transformer.blocks.33.ffn.experts.mlp_experts.7.w2 # ffn.experts.mlp_experts.8.v1 layers - transformer.blocks.30.ffn.experts.mlp_experts.8.v1 - transformer.blocks.27.ffn.experts.mlp_experts.8.v1 - transformer.blocks.20.ffn.experts.mlp_experts.8.v1 - transformer.blocks.32.ffn.experts.mlp_experts.8.v1 - transformer.blocks.34.ffn.experts.mlp_experts.8.v1 - transformer.blocks.33.ffn.experts.mlp_experts.8.v1 - transformer.blocks.9.ffn.experts.mlp_experts.8.v1 - transformer.blocks.7.ffn.experts.mlp_experts.8.v1 - transformer.blocks.6.ffn.experts.mlp_experts.8.v1 - transformer.blocks.24.ffn.experts.mlp_experts.8.v1 # ffn.experts.mlp_experts.8.w1 layers - transformer.blocks.7.ffn.experts.mlp_experts.8.w1 - transformer.blocks.6.ffn.experts.mlp_experts.8.w1 - transformer.blocks.0.ffn.experts.mlp_experts.8.w1 - transformer.blocks.9.ffn.experts.mlp_experts.8.w1 - transformer.blocks.3.ffn.experts.mlp_experts.8.w1 - transformer.blocks.2.ffn.experts.mlp_experts.8.w1 - transformer.blocks.8.ffn.experts.mlp_experts.8.w1 - transformer.blocks.30.ffn.experts.mlp_experts.8.w1 - transformer.blocks.24.ffn.experts.mlp_experts.8.w1 - transformer.blocks.1.ffn.experts.mlp_experts.8.w1 # ffn.experts.mlp_experts.8.w2 layers - transformer.blocks.32.ffn.experts.mlp_experts.8.w2 - transformer.blocks.24.ffn.experts.mlp_experts.8.w2 - transformer.blocks.27.ffn.experts.mlp_experts.8.w2 - transformer.blocks.30.ffn.experts.mlp_experts.8.w2 - transformer.blocks.31.ffn.experts.mlp_experts.8.w2 - transformer.blocks.28.ffn.experts.mlp_experts.8.w2 - transformer.blocks.2.ffn.experts.mlp_experts.8.w2 - transformer.blocks.3.ffn.experts.mlp_experts.8.w2 - transformer.blocks.23.ffn.experts.mlp_experts.8.w2 - transformer.blocks.29.ffn.experts.mlp_experts.8.w2 # ffn.experts.mlp_experts.9.v1 layers - transformer.blocks.31.ffn.experts.mlp_experts.9.v1 - transformer.blocks.27.ffn.experts.mlp_experts.9.v1 - transformer.blocks.29.ffn.experts.mlp_experts.9.v1 - transformer.blocks.33.ffn.experts.mlp_experts.9.v1 - transformer.blocks.25.ffn.experts.mlp_experts.9.v1 - transformer.blocks.14.ffn.experts.mlp_experts.9.v1 - transformer.blocks.32.ffn.experts.mlp_experts.9.v1 - transformer.blocks.7.ffn.experts.mlp_experts.9.v1 - transformer.blocks.9.ffn.experts.mlp_experts.9.v1 - transformer.blocks.34.ffn.experts.mlp_experts.9.v1 # ffn.experts.mlp_experts.9.w1 layers - transformer.blocks.7.ffn.experts.mlp_experts.9.w1 - transformer.blocks.1.ffn.experts.mlp_experts.9.w1 - transformer.blocks.9.ffn.experts.mlp_experts.9.w1 - transformer.blocks.2.ffn.experts.mlp_experts.9.w1 - transformer.blocks.27.ffn.experts.mlp_experts.9.w1 - transformer.blocks.12.ffn.experts.mlp_experts.9.w1 - transformer.blocks.4.ffn.experts.mlp_experts.9.w1 - transformer.blocks.6.ffn.experts.mlp_experts.9.w1 - transformer.blocks.19.ffn.experts.mlp_experts.9.w1 - transformer.blocks.8.ffn.experts.mlp_experts.9.w1 # ffn.experts.mlp_experts.9.w2 layers - transformer.blocks.26.ffn.experts.mlp_experts.9.w2 - transformer.blocks.25.ffn.experts.mlp_experts.9.w2 - transformer.blocks.28.ffn.experts.mlp_experts.9.w2 - transformer.blocks.27.ffn.experts.mlp_experts.9.w2 - transformer.blocks.31.ffn.experts.mlp_experts.9.w2 - transformer.blocks.29.ffn.experts.mlp_experts.9.w2 - transformer.blocks.7.ffn.experts.mlp_experts.9.w2 - transformer.blocks.34.ffn.experts.mlp_experts.9.w2 - transformer.blocks.2.ffn.experts.mlp_experts.9.w2 - transformer.blocks.33.ffn.experts.mlp_experts.9.w2 # ffn.router.layer layers - transformer.blocks.2.ffn.router.layer - transformer.blocks.3.ffn.router.layer - transformer.blocks.4.ffn.router.layer - transformer.blocks.5.ffn.router.layer - transformer.blocks.6.ffn.router.layer - transformer.blocks.7.ffn.router.layer - transformer.blocks.8.ffn.router.layer - transformer.blocks.9.ffn.router.layer - transformer.blocks.10.ffn.router.layer - transformer.blocks.11.ffn.router.layer # norm_attn_norm.attn.Wqkv layers - transformer.blocks.16.norm_attn_norm.attn.Wqkv - transformer.blocks.15.norm_attn_norm.attn.Wqkv - transformer.blocks.11.norm_attn_norm.attn.Wqkv - transformer.blocks.14.norm_attn_norm.attn.Wqkv - transformer.blocks.12.norm_attn_norm.attn.Wqkv - transformer.blocks.20.norm_attn_norm.attn.Wqkv - transformer.blocks.10.norm_attn_norm.attn.Wqkv - transformer.blocks.9.norm_attn_norm.attn.Wqkv - transformer.blocks.19.norm_attn_norm.attn.Wqkv - transformer.blocks.18.norm_attn_norm.attn.Wqkv # norm_attn_norm.attn.out_proj layers - transformer.blocks.1.norm_attn_norm.attn.out_proj - transformer.blocks.18.norm_attn_norm.attn.out_proj - transformer.blocks.2.norm_attn_norm.attn.out_proj - transformer.blocks.16.norm_attn_norm.attn.out_proj - transformer.blocks.0.norm_attn_norm.attn.out_proj - transformer.blocks.39.norm_attn_norm.attn.out_proj - transformer.blocks.23.norm_attn_norm.attn.out_proj - transformer.blocks.8.norm_attn_norm.attn.out_proj - transformer.blocks.24.norm_attn_norm.attn.out_proj - transformer.blocks.19.norm_attn_norm.attn.out_proj # norm_attn_norm.norm_1 layers - transformer.blocks.0.norm_attn_norm.norm_1 - transformer.blocks.1.norm_attn_norm.norm_1 - transformer.blocks.2.norm_attn_norm.norm_1 - transformer.blocks.3.norm_attn_norm.norm_1 - transformer.blocks.4.norm_attn_norm.norm_1 - transformer.blocks.5.norm_attn_norm.norm_1 - transformer.blocks.6.norm_attn_norm.norm_1 - transformer.blocks.7.norm_attn_norm.norm_1 - transformer.blocks.8.norm_attn_norm.norm_1 - transformer.blocks.9.norm_attn_norm.norm_1 # norm_attn_norm.norm_2 layers - transformer.blocks.0.norm_attn_norm.norm_2 - transformer.blocks.1.norm_attn_norm.norm_2 - transformer.blocks.2.norm_attn_norm.norm_2 - transformer.blocks.3.norm_attn_norm.norm_2 - transformer.blocks.4.norm_attn_norm.norm_2 - transformer.blocks.5.norm_attn_norm.norm_2 - transformer.blocks.6.norm_attn_norm.norm_2 - transformer.blocks.7.norm_attn_norm.norm_2 - transformer.blocks.8.norm_attn_norm.norm_2 - transformer.blocks.9.norm_attn_norm.norm_2 # transformer.norm_f layers # transformer.wte layers # ffn.experts.mlp_experts.11.v1 layers - transformer.blocks.29.ffn.experts.mlp_experts.11.v1 - transformer.blocks.27.ffn.experts.mlp_experts.11.v1 - transformer.blocks.30.ffn.experts.mlp_experts.11.v1 - transformer.blocks.28.ffn.experts.mlp_experts.11.v1 - transformer.blocks.22.ffn.experts.mlp_experts.11.v1 - transformer.blocks.7.ffn.experts.mlp_experts.11.v1 - transformer.blocks.24.ffn.experts.mlp_experts.11.v1 - transformer.blocks.8.ffn.experts.mlp_experts.11.v1 - transformer.blocks.6.ffn.experts.mlp_experts.11.v1 - transformer.blocks.12.ffn.experts.mlp_experts.11.v1 dataset_prepared_path: dbrx2 val_set_size: 0.01 output_dir: ./out sequence_len: 4096 sample_packing: true pad_to_sequence_len: true wandb_project: dolphin-2.9-Dbrx wandb_watch: wandb_run_id: wandb_log_model: gradient_accumulation_steps: 8 micro_batch_size: 1 num_epochs: 1 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 1e-5 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: true gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false early_stopping_patience: # resume_from_checkpoint: /workspace/axolotl/dbrx-checkpoint logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 10 evals_per_epoch: 4 eval_table_size: saves_per_epoch: 4 save_total_limit: 2 save_steps: debug: deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_params.json weight_decay: 0.05 fsdp: fsdp_config: special_tokens: bos_token: "<|endoftext|>" eos_token: "<|im_end|>" pad_token: "<|pad|>" unk_token: "<|endoftext|>" tokens: - "<|im_start|>" - "<|im_end|>" ```

# out This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set: - Loss: 0.4336 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 8 - gradient_accumulation_steps: 8 - total_train_batch_size: 64 - total_eval_batch_size: 8 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 10 - num_epochs: 1 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 0.4009 | 0.0 | 1 | 0.4328 | | 0.413 | 0.25 | 587 | 0.4408 | | 0.3626 | 0.5 | 1174 | 0.4368 | | 0.3896 | 0.75 | 1761 | 0.4336 | ### Framework versions - Transformers 4.40.0.dev0 - Pytorch 2.2.2+cu121 - Datasets 2.15.0 - Tokenizers 0.15.0