94 8 51

Anton Lozhkov

anton-l

AI & ML interests

Generative Models, Distributed Training, Photo and Video Enhancement

Recent Activity

new activity 2 days ago

HuggingFaceTB/finemath:Update README.md

new activity 3 days ago

HuggingFaceTB/finemath:Upload re.zip

new activity 3 days ago

HuggingFaceTB/finemath:Create test

View all activity

Articles

SmolLM - blazingly fast and remarkably powerful

Jul 16

• 294

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 70

StarCoder2 and The Stack v2

Feb 28

• 7

Organizations

Posts 1

Post

2030

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2