btjhjeon
's Collections
Multimodal Benchmarks
updated
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
43
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper
•
2407.12772
•
Published
•
33
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
Models
Paper
•
2407.11691
•
Published
•
13
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models
Paper
•
2408.02718
•
Published
•
60
Teaching CLIP to Count to Ten
Paper
•
2302.12066
•
Published
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
•
2408.11817
•
Published
•
8
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
•
2408.13257
•
Published
•
25
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal
Models in Multi-View Urban Scenarios
Paper
•
2408.17267
•
Published
•
23
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
Models for Trait Discovery from Biological Images
Paper
•
2408.16176
•
Published
•
7
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding
Benchmark
Paper
•
2409.02813
•
Published
•
28
DSBench: How Far Are Data Science Agents to Becoming Data Science
Experts?
Paper
•
2409.07703
•
Published
•
66
OmniBench: Towards The Future of Universal Omni-Language Models
Paper
•
2409.15272
•
Published
•
26
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating
Satire Comprehension capability of Vision-Language Models
Paper
•
2409.13592
•
Published
•
49
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short
Videos
Paper
•
2410.02763
•
Published
•
7
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of
Large Multimodal Models Through Coding Tasks
Paper
•
2410.12381
•
Published
•
43
WorldMedQA-V: a multilingual, multimodal medical examination dataset for
multimodal language models evaluation
Paper
•
2410.12722
•
Published
•
5
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
•
2410.10139
•
Published
•
51
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Paper
•
2410.10563
•
Published
•
38
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Paper
•
2410.10783
•
Published
•
26
TemporalBench: Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models
Paper
•
2410.10818
•
Published
•
15
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
Vision-Language Models
Paper
•
2410.09733
•
Published
•
8
TVBench: Redesigning Video-Language Evaluation
Paper
•
2410.07752
•
Published
•
5
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
Paper
•
2410.13754
•
Published
•
75
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
•
2410.12787
•
Published
•
31
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding
Benchmark for Culture-aware Evaluation
Paper
•
2410.17250
•
Published
•
14
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
Samples
Paper
•
2410.14669
•
Published
•
36
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Paper
•
2410.18976
•
Published
•
9
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing
Prompts
Paper
•
2410.18071
•
Published
•
6
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
•
2410.18057
•
Published
•
200
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Paper
•
2410.19168
•
Published
•
19
BenchX: A Unified Benchmark Framework for Medical Vision-Language
Pretraining on Chest X-Rays
Paper
•
2410.21969
•
Published
•
9
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal
Foundation Models
Paper
•
2410.23266
•
Published
•
20
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical
Reasoning Robustness of Vision Language Models
Paper
•
2411.00836
•
Published
•
15
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for
Evaluating Foundation Models
Paper
•
2411.04075
•
Published
•
15
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
•
2411.06176
•
Published
•
44
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
•
2411.07975
•
Published
•
27
VLRewardBench: A Challenging Benchmark for Vision-Language Generative
Reward Models
Paper
•
2411.17451
•
Published
•
10
Interleaved Scene Graph for Interleaved Text-and-Image Generation
Assessment
Paper
•
2411.17188
•
Published
•
21
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
•
2411.17991
•
Published
•
5
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Paper
•
2411.15296
•
Published
•
19
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information
Paper
•
2412.00947
•
Published
•
7
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
•
2412.02611
•
Published
•
23
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Paper
•
2412.07825
•
Published
•
12
OmniDocBench: Benchmarking Diverse PDF Document Parsing with
Comprehensive Annotations
Paper
•
2412.07626
•
Published
•
21
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
52
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
Paper
•
2412.07769
•
Published
•
26
Multi-Dimensional Insights: Benchmarking Real-World Personalization in
Large Multimodal Models
Paper
•
2412.12606
•
Published
•
41
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in
Financial Domain
Paper
•
2412.13018
•
Published
•
41
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces
Paper
•
2412.14171
•
Published
•
23
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
•
2412.18072
•
Published
•
14