HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Paper • 2412.21199 • Published 4 days ago • 9
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Paper • 2408.03314 • Published Aug 6, 2024 • 53
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training Paper • 2411.15124 • Published Nov 22, 2024 • 57
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation Paper • 2410.23090 • Published Oct 30, 2024 • 54