-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Paper • 2310.03714 • Published • 33 -
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
Paper • 2312.10003 • Published • 38 -
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
Paper • 2308.08155 • Published • 5 -
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 188
Collections
Discover the best community collections!
Collections including paper arxiv:2310.06770
-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 22 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
Creative Robot Tool Use with Large Language Models
Paper • 2310.13065 • Published • 9 -
CodeCoT and Beyond: Learning to Program and Test like a Developer
Paper • 2308.08784 • Published • 5 -
Lemur: Harmonizing Natural Language and Code for Language Agents
Paper • 2310.06830 • Published • 31 -
CodePlan: Repository-level Coding using LLMs and Planning
Paper • 2309.12499 • Published • 74
-
KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
Paper • 2310.15511 • Published • 4 -
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Paper • 2310.14566 • Published • 25 -
SmartPlay : A Benchmark for LLMs as Intelligent Agents
Paper • 2310.01557 • Published • 12 -
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Paper • 2310.03214 • Published • 18
-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 22 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 8 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Paper • 2102.04664 • Published • 2