scale-up? or full datasets list?
#11
by
lucyknada
- opened
the 1.7b already works great but is sometimes missing simply because of its size I assume; is something in the range of 3b/4b planned? or could the full datasets list be released? thanks!
Hi, that's in the roadmap. Regarding the datasets, we use a mix of FineWeb-Edu, DCLM and The Stack with new math and code datasets that we will release in the upcoming weeks with a tech report.
We released the SFT dataset here: https://huggingface.co/datasets/HuggingFaceTB/smoltalk
amazing, thanks!
loubnabnl
changed discussion status to
closed