Update README.md
Browse files
README.md
CHANGED
@@ -27,7 +27,46 @@ The **Llama-SmolTalk-3.2-1B-Instruct** model is a lightweight, instruction-tuned
|
|
27 |
- **Instruction Execution**: Follow user commands to generate precise and relevant responses.
|
28 |
|
29 |
### Technical Details:
|
30 |
-
The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
## Prompt format
|
33 |
|
|
|
27 |
- **Instruction Execution**: Follow user commands to generate precise and relevant responses.
|
28 |
|
29 |
### Technical Details:
|
30 |
+
The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing.
|
31 |
+
|
32 |
+
#### Dataset description
|
33 |
+
This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build [SmolLM2-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) family of models and contains 1M samples.
|
34 |
+
|
35 |
+
During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets that improve instruction following while covering diverse tasks including text editing, rewriting, summarization, and reasoning.
|
36 |
+
Through a series of data ablations at 1.7B scale, we enhanced our SFT mix by incorporating public datasets to strengthen specific capabilities such as mathematics, coding, system prompt following and long-context understanding.
|
37 |
+
|
38 |
+
All the new datasets were generated with [distilabel](https://github.com/argilla-io/distilabel) and you can find the generation code here https://github.com/huggingface/smollm/tree/main/distilabel_pipelines.
|
39 |
+
|
40 |
+
#### Dataset composition
|
41 |
+
The mix consists of:
|
42 |
+
|
43 |
+
**New datasets**
|
44 |
+
- *Smol-Magpie-Ultra*: the core component of our mix, consisting of 400K samples generated using the Magpie pipeline with /Llama-3.1-405B-Instruct. We also heavily curate and filter this dataset compared to the original Magpie-Pro pipeline. SmolLM models trained on this dataset alone outperform those trained on popular public datasets like OpenHermes and Magpie Pro across key benchmarks including IFEval and MT-Bench.
|
45 |
+
- Smol-contraints: a 36K-sample dataset that trains models to follow specific constraints, such as generating responses with a fixed number of sentences or words, or incorporating specified words in the output. The dataset has been decontaminated against IFEval to prevent overlap.
|
46 |
+
- Smol-rewrite: an 50k-sample collection focused on text rewriting tasks, such as adjusting tone to be more friendly or professional. Note that Smol-Magpie-Ultra also includes some rewriting, editing, and summarization examples.
|
47 |
+
- Smol-summarize: an 100k-sample dataset specialized in email and news summarization.
|
48 |
+
|
49 |
+
**Existing public datasets**
|
50 |
+
To enhance capabilities in mathematics, coding, system prompts, and long-context understanding, we fine-tuned SmolLM2-1.7B on various public SFT datasets and included subsets of the best performing ones using tuned ratios. These include:
|
51 |
+
|
52 |
+
- OpenHermes2.5: we added 100k samples from [OpenHermes2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5), since we found that it helps preserve and boost benchmarks such as MMLU and WinoGrande, and BBH.
|
53 |
+
- MetaMathQA: we add this [dataset](https://huggingface.co/datasets/meta-math/MetaMathQA?) to improve the model on mathematics and reasoning, we include 50k random samples.
|
54 |
+
- NuminaMath-CoT: we find that this [dataset](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) helps on mathematics, especially hard problems found in benchmarks such as MATH.
|
55 |
+
- Self-Oss-Starcoder2-Instruct: we use this [dataset](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) to improve coding capabilities.
|
56 |
+
- SystemChats2.0: to make the model support a variety of system prompt formats we add 30k samples from the [SystemChat-2.0](https://huggingface.co/datasets/cognitivecomputations/SystemChat-2.0) dataset. Note that Smol-rewrite and and Smol-summarize datasets also include system prompts.
|
57 |
+
- LongAlign: we find that finetuning the model on only short samples makes it loose long context abilities beyond 2048 tokens, so we add english samples (with less than 16k tokens) from the [LongAlign-10k](https://huggingface.co/datasets/THUDM/LongAlign-10k) dataset and train with a 8192 sequence.
|
58 |
+
- Everyday-conversations: this [dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k) includes multi-turn everyday conversations such as greeting and was used in SmolLM v1 post-training.
|
59 |
+
- APIGen-Function-Calling: we use 80k samples from [apigen-function-calling](https://huggingface.co/datasets/argilla/apigen-function-calling) which is a mix of [Synth-APIGen-v0.1](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1) and [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) datasets.
|
60 |
+
- Explore-Instruct-Rewriting: 30k samples from this rewriting [dataset](https://huggingface.co/datasets/Wanfq/Explore_Instruct_Rewriting_32k).
|
61 |
+
|
62 |
+
|
63 |
+
You can find the code for generating the new datasets with [distilabel](https://github.com/argilla-io/distilabel) here: https://github.com/huggingface/smollm. The ablation details will be included in an upcoming blog post.
|
64 |
+
|
65 |
+
#### License
|
66 |
+
|
67 |
+
All the new datasets (Smol-Magpie-Ultra, Smol-contraints, Smol-rewrite, Smol-summarize) are licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). For the existing public datasets, please refer to the original dataset for the license [Dataset composition](#dataset-composition)
|
68 |
+
|
69 |
+
---
|
70 |
|
71 |
## Prompt format
|
72 |
|