Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
Abstract
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced.
Community
Our code, model, dataset, and benchmark are fully open-sourced !
Project: https://migician-vg.github.io
Paper: https://arxiv.org/abs/2501.05767
Code: https://github.com/thunlp/Migician
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation (2024)
- Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment (2024)
- Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models (2024)
- ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding (2024)
- TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action (2024)
- CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models (2024)
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper