476 12 91

Loubna Ben Allal

loubnabnl

https://loubnabnl.github.io/

AI & ML interests

SmolLMs, ML for code, data

Recent Activity

new activity 6 days ago

HuggingFaceTB/finemath:Why did you use CC rather than FineWeb to create FineMath?

updated a dataset 8 days ago

loubnabnl/mmlu-evals-smollm-360m

updated a dataset 8 days ago

loubnabnl/code_data

View all activity

Articles

SmolLM - blazingly fast and remarkably powerful

Jul 16

• 294

CodeGemma - an official Google release for code LLMs

Apr 9

• 99

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 70

Organizations

Posts 4

Post

1675

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

Post

5365

🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!

View all posts