floom
's Collections
Paper
•
2402.03570
•
Published
•
7
Iterative Data Smoothing: Mitigating Reward Overfitting and
Overoptimization in RLHF
Paper
•
2401.16335
•
Published
•
1
Towards Efficient and Exact Optimization of Language Model Alignment
Paper
•
2402.00856
•
Published
ODIN: Disentangled Reward Mitigates Hacking in RLHF
Paper
•
2402.07319
•
Published
•
13
Preference-free Alignment Learning with Regularized Relevance Reward
Paper
•
2402.03469
•
Published
Teaching Large Language Models to Reason with Reinforcement Learning
Paper
•
2403.04642
•
Published
•
46
RewardBench: Evaluating Reward Models for Language Modeling
Paper
•
2403.13787
•
Published
•
21
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper
•
2403.10704
•
Published
•
57
Stop Regressing: Training Value Functions via Classification for
Scalable Deep RL
Paper
•
2403.03950
•
Published
•
13
In deep reinforcement learning, a pruned network is a good network
Paper
•
2402.12479
•
Published
•
18
Direct Nash Optimization: Teaching Language Models to Self-Improve with
General Preferences
Paper
•
2404.03715
•
Published
•
60
Learn Your Reference Model for Real Good Alignment
Paper
•
2404.09656
•
Published
•
82
Offline Regularised Reinforcement Learning for Large Language Models
Alignment
Paper
•
2405.19107
•
Published
•
14
Self-Improving Robust Preference Optimization
Paper
•
2406.01660
•
Published
•
18
Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning
Enhancement in RLHF and Effective-Merged LLMs
Paper
•
2406.08657
•
Published
•
9
BPO: Supercharging Online Preference Learning by Adhering to the
Proximity of Behavior LLM
Paper
•
2406.12168
•
Published
•
7
THEANINE: Revisiting Memory Management in Long-term Conversations with
Timeline-augmented Response Generation
Paper
•
2406.10996
•
Published
•
32
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper
•
2406.11827
•
Published
•
14
Understanding and Diagnosing Deep Reinforcement Learning
Paper
•
2406.16979
•
Published
•
9
Gradient Boosting Reinforcement Learning
Paper
•
2407.08250
•
Published
•
10
Understanding Reference Policies in Direct Preference Optimization
Paper
•
2407.13709
•
Published
•
16
Leveraging Skills from Unlabeled Prior Data for Efficient Online
Exploration
Paper
•
2410.18076
•
Published
•
4
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context
Learning via MCTS
Paper
•
2411.18478
•
Published
•
32
A Simple and Provable Scaling Law for the Test-Time Compute of Large
Language Models
Paper
•
2411.19477
•
Published
•
5