Towards Best Practices for Open Datasets for LLM Training
Paper
β’
2501.08365
β’
Published
β’
44
None defined yet.
datatrove
for all things web-scale data preparation: https://github.com/huggingface/datatrovenanotron
for lightweight 4D parallelism LLM training: https://github.com/huggingface/nanotronlighteval
for in-training fast parallel LLM evaluations: https://github.com/huggingface/lighteval