Loubna Ben Allal

loubnabnl

AI & ML interests

SmolLMs, ML for code, data

Recent Activity

Articles

Organizations

Hugging Face's profile picture BigScience Workshop's profile picture BigScience Catalogue Data's profile picture BigScience Data's profile picture HuggingFaceBR4's profile picture Team 8's profile picture CodeParrot's profile picture BigCode's profile picture Hugging Face H4's profile picture CompVis Community's profile picture BigCode Data's profile picture LocalCodeLLMs's profile picture Need4Speed's profile picture Code Llama's profile picture Hugging Face TB Research's profile picture Hugging Face Smol Cluster's profile picture Nt3awnou's profile picture huggingPartyParis's profile picture Qwen's profile picture ZeroGPU Explorers's profile picture HF AFAIK's profile picture gg-hf's profile picture Nanotron Research's profile picture Women on Hugging Face's profile picture Hugging Face SMOL's profile picture HuggingFaceFW's profile picture bigcode nvidia's profile picture Social Post Explorers's profile picture Dev Mode Explorers's profile picture Cosmopedia Stories Collab's profile picture StarCoder2 Data's profile picture Data Agents's profile picture Argilla Warehouse's profile picture smol-explorers's profile picture swissai-hf-data's profile picture Hugging Face Science's profile picture

Posts 4

view post
Post
1675
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
view post
Post
5365
🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!