NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Abstract
Vision-language models (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a vision-centric design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.
Community
🚀 Make Vision Matter in Visual-Question-Answering (VQA)!
Introducing NaturalBench, a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with pairs of simple questions about natural imagery. 🌍📸
Here’s what we found after testing 53 models (including GPT-4o, Llama3.2, Qwen2VL, and Molmo):
1️⃣ All models struggle: They perform only 10-20% above random chance, while human accuracy exceeds 90%!
2️⃣ Models appear strong in previous benchmarks like MME/ScienceQA by exploiting their strong language bias. However, even a blind ChatGPT (without vision) can outperform vision models on these benchmarks.
3️⃣ Debiasing is crucial: Most models prefer "Yes" far more than "No" — correcting this bias can nearly double performance, even for GPT-4o.
Paper: https://arxiv.org/abs/2410.14669
Dataset: https://huggingface.co/datasets/BaiqiL/NaturalBench
Website: https://linzhiqiu.github.io/papers/naturalbench/
Work led by CMU & UW with Baiqi Li, Zhiqiu Lin, Wenxuan Peng, Jean de Dieu Nyandwi, Daniel Jiang, Zixian Ma, Simran Khanuja, Ranjay Krishna, Graham Neubig, Deva Ramanan
Popular VQA benchmarks like MME, MMMU, MMBench, and ScienceQA are prone to blind solutions. For example, models can exploit language bias to answer questions like “What is the capital of Massachusetts?” (“Boston”) without looking at the image.
To solve this, NaturalBench pairs two images with two questions that require opposite answers, preventing blind models from succeeding.
*NaturalBench is collected using a simple pipeline from datasets like Flickr30K by (1) identifying image-text pairs that CLIP fails to match and (2) prompting ChatGPT to generate questions with different answers for each image.
Since NaturalBench avoids perturbing images or questions, it creates natural adversarial samples—questions about natural images that are easy for humans but challenge models.
*While previous VQA benchmarks can be solved by fine-tuning a blind GPT-3.5, NaturalBench cannot!
Most open-source models score only 10–20% above chance, and even GPT-4o (vision-finetuned) falls ~50% behind humans.
*Vision-language models show strong answer biases, often favoring “Yes” over “No” regardless of the input image/question. Correcting these biases can boost top models' performance by 2-3x, making NaturalBench a valuable testbed for future debiasing efforts.
*NaturalBench offers 1-8 skill tags per question for a fine-grained evaluation of compositional reasoning across dimensions like object, attribute, relationship, reasoning, and more.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TVBench: Redesigning Video-Language Evaluation (2024)
- MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models (2024)
- Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs (2024)
- VHELM: A Holistic Evaluation of Vision Language Models (2024)
- Trust but Verify: Programmatic VLM Evaluation in the Wild (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper