Data-efficient LLMs - a VoladorLuYu Collection

VoladorLuYu 's Collections

Research on LLM

Generative Multiple Modality

Super Alignment

Foundation Machine Learning

Graph Foundation Multimodal Models

Symbolic LLM Reasoning

Data-efficient LLMs

Understanding LLM

synthetic code generation

Diffusion Models

LLM+Architecture

LLM+Self-Play RL

Data-efficient LLMs

updated Jun 27, 2024

dataset pruning for advancing the capabilities of LLMs

Effective pruning of web-scale datasets based on complexity of concept clusters

Paper • 2401.04578 • Published Jan 9, 2024
How to Train Data-Efficient LLMs

Paper • 2402.09668 • Published Feb 15, 2024 • 40
A Survey on Data Selection for LLM Instruction Tuning

Paper • 2402.05123 • Published Feb 4, 2024 • 3
LESS: Selecting Influential Data for Targeted Instruction Tuning

Paper • 2402.04333 • Published Feb 6, 2024 • 3
LongAlign: A Recipe for Long Context Alignment of Large Language Models

Paper • 2401.18058 • Published Jan 31, 2024 • 20
LongHeads: Multi-Head Attention is Secretly a Long Context Processor

Paper • 2402.10685 • Published Feb 16, 2024 • 1
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

Paper • 2402.10379 • Published Feb 16, 2024 • 30
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Paper • 2402.13064 • Published Feb 20, 2024 • 47
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Paper • 2401.00690 • Published Jan 1, 2024 • 1
Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models

Paper • 2402.11532 • Published Feb 18, 2024
Datasets for Large Language Models: A Comprehensive Survey

Paper • 2402.18041 • Published Feb 28, 2024 • 2
Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Paper • 2402.17327 • Published Feb 27, 2024
Parallel Structures in Pre-training Data Yield In-Context Learning

Paper • 2402.12530 • Published Feb 19, 2024
A Survey on Data Selection for Language Models

Paper • 2402.16827 • Published Feb 26, 2024 • 4
Rethinking Machine Unlearning for Large Language Models

Paper • 2402.08787 • Published Feb 13, 2024 • 3
Less is More: Data Value Estimation for Visual Instruction Tuning

Paper • 2403.09559 • Published Mar 14, 2024
Token Alignment via Character Matching for Subword Completion

Paper • 2403.08688 • Published Mar 13, 2024
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

Paper • 2404.04125 • Published Apr 4, 2024 • 27
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Paper • 2403.16952 • Published Mar 25, 2024 • 1
Token Dropping for Efficient BERT Pretraining

Paper • 2203.13240 • Published Mar 24, 2022 • 2
Pre-training Small Base LMs with Fewer Tokens

Paper • 2404.08634 • Published Apr 12, 2024 • 34
WildChat: 1M ChatGPT Interaction Logs in the Wild

Paper • 2405.01470 • Published May 2, 2024 • 61
Data-Juicer: A One-Stop Data Processing System for Large Language Models

Paper • 2309.02033 • Published Sep 5, 2023 • 3
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition

Paper • 2310.05492 • Published Oct 9, 2023 • 2
Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

Paper • 2403.13638 • Published Mar 20, 2024
Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Paper • 2305.15273 • Published May 24, 2023 • 1
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 87