scale-up? or full datasets list?

#11

by lucyknada - opened Nov 6, 2024

Nov 6, 2024

the 1.7b already works great but is sometimes missing simply because of its size I assume; is something in the range of 3b/4b planned? or could the full datasets list be released? thanks!

loubnabnl

Hugging Face TB Research org Nov 21, 2024

Hi, that's in the roadmap. Regarding the datasets, we use a mix of FineWeb-Edu, DCLM and The Stack with new math and code datasets that we will release in the upcoming weeks with a tech report.

loubnabnl

Hugging Face TB Research org Nov 22, 2024

We released the SFT dataset here: https://huggingface.co/datasets/HuggingFaceTB/smoltalk

lucyknada

Nov 22, 2024

amazing, thanks!

loubnabnl changed discussion status to closed Nov 29, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment