-
A Survey on Data Selection for LLM Instruction Tuning
Paper • 2402.05123 • Published • 3 -
Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development
Paper • 2407.11784 • Published • 4 -
Data Management For Large Language Models: A Survey
Paper • 2312.01700 • Published -
Datasets for Large Language Models: A Comprehensive Survey
Paper • 2402.18041 • Published • 2
Collections
Discover the best community collections!
Collections including paper arxiv:2402.18041
-
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Paper • 2306.04387 • Published • 8 -
Datasets for Large Language Models: A Comprehensive Survey
Paper • 2402.18041 • Published • 2 -
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Paper • 2306.06687 • Published • 1 -
Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents
Paper • 2201.04236 • Published
-
Effective pruning of web-scale datasets based on complexity of concept clusters
Paper • 2401.04578 • Published -
How to Train Data-Efficient LLMs
Paper • 2402.09668 • Published • 40 -
A Survey on Data Selection for LLM Instruction Tuning
Paper • 2402.05123 • Published • 3 -
LESS: Selecting Influential Data for Targeted Instruction Tuning
Paper • 2402.04333 • Published • 3