-
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
Paper • 2405.07990 • Published • 16 -
Large Language Models as Planning Domain Generators
Paper • 2405.06650 • Published • 9 -
AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation
Paper • 2404.12753 • Published • 41 -
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Paper • 2404.07972 • Published • 46
Collections
Discover the best community collections!
Collections including paper arxiv:2404.03543
-
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15 -
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Paper • 2406.11931 • Published • 58 -
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 33 -
Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents
Paper • 2408.07060 • Published • 40
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 187 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 24 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 34
-
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Paper • 2404.05961 • Published • 64 -
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper • 2404.05719 • Published • 82 -
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15 -
Are large language models superhuman chemists?
Paper • 2404.01475 • Published • 16
-
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15 -
McEval: Massively Multilingual Code Evaluation
Paper • 2406.07436 • Published • 39 -
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Paper • 2406.15877 • Published • 45 -
Qwen2 Technical Report
Paper • 2407.10671 • Published • 160
-
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Paper • 2403.09029 • Published • 54 -
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
Paper • 2403.12968 • Published • 24 -
RAFT: Adapting Language Model to Domain Specific RAG
Paper • 2403.10131 • Published • 67 -
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Paper • 2403.09629 • Published • 75
-
A Survey on Language Models for Code
Paper • 2311.07989 • Published • 21 -
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 4 -
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 11 -
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper • 2402.14261 • Published • 10
-
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Paper • 2306.00107 • Published • 3 -
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Paper • 2309.08730 • Published • 1 -
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Paper • 2402.16153 • Published • 56 -
CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark
Paper • 2401.11944 • Published • 27