Rethinking FID: Towards a Better Evaluation Metric for Image Generation Paper • 2401.09603 • Published Nov 30, 2023 • 16
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models Paper • 2402.10524 • Published Feb 16, 2024 • 22
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Paper • 2402.14261 • Published Feb 22, 2024 • 10
RewardBench: Evaluating Reward Models for Language Modeling Paper • 2403.13787 • Published Mar 20, 2024 • 21
Evaluating Frontier Models for Dangerous Capabilities Paper • 2403.13793 • Published Mar 20, 2024 • 7