File size: 3,462 Bytes
f8da8cb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
---
license: other
license_name: tongyi-qianwen-license-agreement
license_link: >-
https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT
datasets:
- Crystalcareai/MoD
---
# Please note this is the model that accompanies the dataset; https://huggingface.co/datasets/Crystalcareai/MoD. The readme is the same for both, with more detail below
## Hey, I'm Lucas
I'm excited to share an early release of a project that has kept me busy for the last couple of weeks. Mixtral's release propelled me into a deep dive into MoEs. This led to my first experiments with post-training, starting with fine tuning using monsterapi around the middle of December, and later transitioning to axolotl as I got more comfortable with command lines and terminals.
With the release of Qwen1.5, I was curious to see how it would compare to Mixtral. Thanks to lazymergekit, which simplifies the process for newcomers, I was able to give Qwen1.5-7B a unique twist.
Coming from a background as an acting teacher and coach, I saw parallels between high-quality scripts' impact on performances and the importance of curating high-quality data for training models. This led me to explore data curation, especially for training Mixture of Experts (MoE) models. I looked into Teknium's OpenHermes dataset, Jon Durbin's collections on GitHub, and Eric Hartford's methods for achieving specific outcomes with models.
I curated a dataset, named Mixture of Data (MoD), from various sources, including Bagel, OpenHermes, and many more, totaling about 780,000 distinct ShareGPT conversations. This dataset aims to encourage MoE models to develop their own distinct experts.
After training Qwen1.5-7b on 100k random samples from MoD over four epochs and merging the fine-tuned model 8x, I used an approach utilizing a random gate, without specialized fine-tuning done to any of the 8 experts. The result was a model that initially made no sense, lacking a base model and clear guidance on expert usage.
Despite challenges, such as training interruptions via cuda errors with Runpod , the model showed promising adaptability to the rest of the MoD dataset, even with limited training (0.45/4 planned epochs were completed before my compute budget ran out). While I haven't been able to benchmark it fully (I will when I can get this runpod situation sorted) it appears to perform comparably to Mixtral in (admittedly naive) preliminary reasoning tests.
These weeks have been incredibly rewarding and educational, thanks to the contributions of Jon Durbin, Maxime Labonne, Teknium, Eric Hartford, and Charles Goddard. Their work has made these technologies accessible and inspired my project. A special thank you to Teknium and Eric Hartford, who have been generous with their time - answering my questions with kindness and humility.
I’m hoping to receive compensation from Runpod for the interruptions (and the resulting LARGE amount of wasted $$$), and will complete the full fine-tuning and report the results here. I hope the MoD dataset and Qwen1.5-8x7b model will be valuable to the community and encourage further exploration with these architectures.
I am fully committed to this field and plan to continue developing models (eventually as a career). ML is fascinating, and I look forward to contributing to its advancement, however big or small.
Thank you for your interest and support. Let's push the boundaries of what's possible together.
Lucas
|