Effective pruning of web-scale datasets based on complexity of concept clusters Paper • 2401.04578 • Published Jan 9, 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning Paper • 2402.04333 • Published Feb 6, 2024 • 3
LongAlign: A Recipe for Long Context Alignment of Large Language Models Paper • 2401.18058 • Published Jan 31, 2024 • 20
LongHeads: Multi-Head Attention is Secretly a Long Context Processor Paper • 2402.10685 • Published Feb 16, 2024 • 1
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Paper • 2402.10379 • Published Feb 16, 2024 • 30
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20, 2024 • 47
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions Paper • 2401.00690 • Published Jan 1, 2024 • 1
Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models Paper • 2402.11532 • Published Feb 18, 2024
Datasets for Large Language Models: A Comprehensive Survey Paper • 2402.18041 • Published Feb 28, 2024 • 2
Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond Paper • 2402.17327 • Published Feb 27, 2024
Parallel Structures in Pre-training Data Yield In-Context Learning Paper • 2402.12530 • Published Feb 19, 2024
Rethinking Machine Unlearning for Large Language Models Paper • 2402.08787 • Published Feb 13, 2024 • 3
Less is More: Data Value Estimation for Visual Instruction Tuning Paper • 2403.09559 • Published Mar 14, 2024
Token Alignment via Character Matching for Subword Completion Paper • 2403.08688 • Published Mar 13, 2024
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance Paper • 2404.04125 • Published Apr 4, 2024 • 27
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance Paper • 2403.16952 • Published Mar 25, 2024 • 1
Data-Juicer: A One-Stop Data Processing System for Large Language Models Paper • 2309.02033 • Published Sep 5, 2023 • 3
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition Paper • 2310.05492 • Published Oct 9, 2023 • 2
Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese Paper • 2403.13638 • Published Mar 20, 2024
Revisiting Token Dropping Strategy in Efficient BERT Pretraining Paper • 2305.15273 • Published May 24, 2023 • 1
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25, 2024 • 87