-
KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
Paper • 2310.15511 • Published • 4 -
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Paper • 2310.14566 • Published • 25 -
SmartPlay : A Benchmark for LLMs as Intelligent Agents
Paper • 2310.01557 • Published • 12 -
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Paper • 2310.03214 • Published • 18
Collections
Discover the best community collections!
Collections including paper arxiv:2311.12022
-
Chain-of-Thought Reasoning Without Prompting
Paper • 2402.10200 • Published • 104 -
How to Train Data-Efficient LLMs
Paper • 2402.09668 • Published • 40 -
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Paper • 2402.10193 • Published • 19 -
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Paper • 2402.09727 • Published • 36
-
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 4 -
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models
Paper • 2309.06495 • Published • 1 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35
-
Holistic Evaluation of Text-To-Image Models
Paper • 2311.04287 • Published • 11 -
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
Paper • 2311.07463 • Published • 13 -
Trusted Source Alignment in Large Language Models
Paper • 2311.06697 • Published • 10 -
DiLoCo: Distributed Low-Communication Training of Language Models
Paper • 2311.08105 • Published • 14
-
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper • 2311.12022 • Published • 26 -
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 187 -
gorilla-llm/APIBench
Updated • 96 • 65 -
Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
Paper • 2312.04724 • Published • 20
-
Personalised Distillation: Empowering Open-Sourced LLMs with Adaptive Learning for Code Generation
Paper • 2310.18628 • Published • 7 -
TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise
Paper • 2310.19019 • Published • 9 -
Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs
Paper • 2311.02262 • Published • 10 -
Thread of Thought Unraveling Chaotic Contexts
Paper • 2311.08734 • Published • 6