-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 188 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 25 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 35
Collections
Discover the best community collections!
Collections including paper arxiv:2409.16191
-
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
Paper • 2401.01275 • Published • 1 -
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Paper • 2404.12241 • Published • 11 -
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Paper • 2405.01535 • Published • 121 -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 37
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 147 -
Orion-14B: Open-source Multilingual Large Language Models
Paper • 2401.12246 • Published • 13 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 54 -
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper • 2401.13601 • Published • 46