Papers
arxiv:2408.16357

Law of Vision Representation in MLLMs

Published on Aug 29, 2024
· Submitted by chenfengx on Aug 30, 2024
Authors:
,

Abstract

We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

Community

Paper author Paper submitter

We study how to connect the visual-representations to the performance of MLLM, and propose an AC policy to suggest which vision model we should use! 😉

·

Hi @chenfengx congrats on this work!

It would be great to update the pipeline_tag: text-generation to pipeline_tag: image-text-to-text in each of the model repositories, which is more appropriate for VLMs (models like LLaVa, Florence-2, PaliGemma etc are also using this tag).

This way people can discover them from https://huggingface.co/models?pipeline_tag=image-text-to-text.

Cheers!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Thanks and congrats on this work!@chenfengx
In your paper:

A SCORE=1ni=0nmaxu,vSc(E^i(u),Ei(v))\text{A SCORE} = \frac{1}{n} \sum_{i=0}^n \max_{u, v} S_{c}(\hat{E}_i^{(u)}, E_i^{(v)})
I have a specific question about the equation. Could you elaborate on how the embedding vector E_i^vis computed from the visual features F? And what is the specific value of visual features F here?

Thank you very much for your time and for sharing your valuable research with the community. I am looking forward to your response.

·

Thanks for your interest in our work!

Sign up or log in to comment

Models citing this paper 26

Browse 26 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2408.16357 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.16357 in a Space README.md to link it from this page.

Collections including this paper 23