-
Transforming and Combining Rewards for Aligning Large Language Models
Paper • 2402.00742 • Published • 12 -
UltraFeedback: Boosting Language Models with High-quality Feedback
Paper • 2310.01377 • Published • 5 -
Learn Your Reference Model for Real Good Alignment
Paper • 2404.09656 • Published • 83 -
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Paper • 2405.01535 • Published • 121
Collections
Discover the best community collections!
Collections including paper arxiv:2404.09656
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 53 -
HyperCLOVA X Technical Report
Paper • 2404.01954 • Published • 21 -
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization
Paper • 2404.09956 • Published • 12 -
Learn Your Reference Model for Real Good Alignment
Paper • 2404.09656 • Published • 83
-
Fine-Tuning Language Models from Human Preferences
Paper • 1909.08593 • Published • 3 -
Transforming and Combining Rewards for Aligning Large Language Models
Paper • 2402.00742 • Published • 12 -
Leverage the Average: an Analysis of KL Regularization in RL
Paper • 2003.14089 • Published • 2 -
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Paper • 2404.01258 • Published • 12
-
InternLM2 Technical Report
Paper • 2403.17297 • Published • 31 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 41 -
Learn Your Reference Model for Real Good Alignment
Paper • 2404.09656 • Published • 83 -
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
Paper • 2404.12195 • Published • 12
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 53 -
ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization
Paper • 2402.09320 • Published • 6 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 41 -
Dueling RL: Reinforcement Learning with Trajectory Preferences
Paper • 2111.04850 • Published • 2
-
ORPO: Monolithic Preference Optimization without Reference Model
Paper • 2403.07691 • Published • 64 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 41 -
Teaching Large Language Models to Reason with Reinforcement Learning
Paper • 2403.04642 • Published • 46 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 30
-
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
Paper • 2310.20587 • Published • 17 -
SELF: Language-Driven Self-Evolution for Large Language Model
Paper • 2310.00533 • Published • 2 -
QLoRA: Efficient Finetuning of Quantized LLMs
Paper • 2305.14314 • Published • 48 -
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper • 2309.14717 • Published • 44
-
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper • 2403.03507 • Published • 184 -
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 69 -
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Paper • 2403.13372 • Published • 64 -
InternLM2 Technical Report
Paper • 2403.17297 • Published • 31