Key Insights into the Law of Vision Representations in MLLMs
arXiv: https://arxiv.org/abs/2408.16357
Hugging Face: https://huggingface.co/papers/2408.16357
GitHub: https://github.com/bronyayang/Law_of_Vision_Representation_in_MLLMs
We gave our paper a somewhat startling title. Essentially, we control the variables within the MLLM and, by only changing the vision representation, identify two factors, cross-modal Alignment and Correspondence, that are closely related to the model’s performance on downstream tasks. Improving the vision representation for these two factors can lead to more competitive MLLM models. We also found that it’s difficult for existing vision representations to excel in both factors, implying a trade-off. Our method requires only a few experiments to identify the optimal vision representation, saving 99.7% of the costs. More importantly, compared to other works, we went a step further in discussing the specific reasons why different vision features have a significant impact on MLLMs.
How this project started
The overall exploration of this paper is based on the MLLM structure of a pretrained vision encoder + connector/alignment module + LLM (self-attention based). The motivation stems from our previous paper HallE-Control, where we discovered hallucination issues caused by CLIP misalignment. Recently, Eyes Wide Shut also highlighted the challenges that CLIP introduces to MLLMs.
Eyes Wide Shut combined two vision features, CLIP and DINOv2, in an interleaved manner and claimed that DINOv2 contains more "detailed information." By supplementing detailed information this way, we also conducted a small experiment where we combined random noise with CLIP embeddings in an interleaved manner and were surprised to find that it yielded similar minor improvements on benchmarks as when DINOv2 was used in the original paper. Our first author, Yang, privately joked with Tong. Very quickly, in their latest paper, they almost made up for the previous shortcomings with an enormous amount of experiments. Saining's real boss, Yann LeCun, truly has deep pockets. This time, they found that using a concatenated approach was better than the interleaved one, and they also did extensive data work. Moreover, they argued that MLLMs should be considered a downstream task of vision representation.
Directly evaluating performance on tasks is certainly beneficial, but treating the entire MLLM pipeline as an evaluation method is incredibly costly. Training a LLaVA model requires 8 A100 GPUs and 15 hours of training. If you increase the resolution or combine more features, it takes even longer, not to mention the number of experiments needed for tuning parameters.
A more important reason is that, as a researcher, I naturally want to understand why some features perform better in MLLMs than others. For instance, when Eyes Wide Shut mentions that DINOv2 provides more detail, I find it puzzling—why does it have more detail? What exactly is the nature of this detail? And why not just use DINOv2 alone? For example, we often combine DINOv2 with CLIP or SigLIP, but if you use DINOv2 alone, you'll find that while it's effective, it doesn't match up to CLIP. Isn't DINOv2 supposed to have more detail than CLIP? And when increasing CLIP's resolution from 224 to 336 results in performance gains, we can simply attribute this to the benefit of higher resolution, but if we think further, in what ways does increasing resolution enhance MLLMs? Why does the combination of 224 resolution CLIP with 224 resolution DINOv2 outperform CLIP336? There must be a deeper reason behind this.
The idea originated from Dr. Xu's recent work on diffusion models for 3D detection, where they were very concerned with correspondence. The method for measuring correspondence involves marking points that have the same meaning on two images and seeing how many of those points can be matched by the most similar features. Based on the definition of correspondence, we can interpret "detail" as ensuring that as much fine-grained mapping as possible is preserved within the feature. This mapping space can be language-based or not.
Law of Vision Representation in MLLMs
The pattern we discovered is quite simple: there is a strong correlation between model performance and cross-modal alignment + correspondence.
A: Cross-modal Alignment is, in my opinion, a necessary factor for MLLMs when you don’t have the luxury of pretraining and are limited to fine-tuning data. In our paper, we provided a relatively simple proof. The general idea is that if the distribution of vision features is closer to that of language, then multimodal fine-tuning will cause less disruption to the original language model’s distribution, leading to more stable training. Simply put, if the vision features take care of the cross-modal alignment, the language model doesn’t have to, which makes the training more efficient. We calculate cross-modal alignment by directly averaging the similarity between the vision embedding you want to measure and CLIP@224 and CLIP@336. (Here’s a personal note: Currently, many people are opting for VLM through post-training with large datasets, which is actually a resource-constrained solution. Both the LLaMA group and the academic community face resource limitations. If large companies were to engage in MLLM pretraining, there would inevitably be conflicts and overlaps with LLM pretraining, leading to political struggles, and MLLM teams often lose out to LLM teams. However, I believe that as LLMs become more refined, we will gradually move toward MLLM pretraining, which means those originally working on LLMs will expand their scope.)
C: Current contrastive-based vision foundation models, due to limitations in data coverage, the contrastive algorithm itself, and data granularity issues, struggle to achieve optimal correspondence, leading to phenomena observed in other papers such as biased object detection, lack of detail, and so on. Correspondence refers to the ability of vision features to enable MLLMs to retrieve image details. We hypothesize that if a text query has attention on an image embedding token in the picture, then a vision feature with good correspondence can retrieve all information related to that image embedding token across the entire image (since accurate correspondence means the highest semantic similarity). We could also crudely assume that all patches related to the image patch are being indexed, or that all details related to the attended detail are being retrieved. Personally, I think this has some similarities to RAG, and based on this assumption, I also believe that this correspondence would be beneficial for video LLMs. However, due to limited GPU resources, we couldn't further conduct experiments related to video in this paper. We welcome anyone interested to follow up on this, or if any sponsors would like to support us with GPUs to test this, we'd be very grateful.
Summary: The model's performance on benchmarks is positively correlated with a quadratic function transformation of A and C. Improving both A and C simultaneously, or keeping one constant while enhancing the other, can both lead to better model performance. However, many of the current tricks and different models are unable to ensure simultaneous improvement of both.
Interesting Findings and Discussions
Allow me to share some of my personal thoughts in this section. These ideas may not all be supported by experiments, but I am eager to discuss them with my peers. If anyone finds some of these points intriguing, our team would be more than happy to collaborate.
1. Why use feature combination and multiple vision encoders?
This was actually the initial point we wanted to explore in this paper. Early on, MLLMs commonly used contrastively pretrained vision encoders like CLIP. During that time, I also experimented with using SAM and DINOv2 features individually, but the results were not promising. We later tried using diffusion models as encoders and performed supervised fine-tuning (SFT) on millions of data points, but the results were still mediocre. Then, Eyes Wide Shut discussed the inherent issues with CLIP and claimed that using an interleaved method for feature concatenation could improve performance. Although I personally have reservations about this method, feature combination has since become mainstream (credit to AI KOL). Tong followed up with a paper that reverted to the simple concatenation method. But why does concatenation improve performance, while individual usage does not? Based on our findings and definitions in this paper, we believe that concatenated features enhance visual representation correspondence while minimizing the decrease in cross-modal alignment. Only when this trade-off is balanced within a suitable range can improvements be seen. If alignment drops significantly, the potential for improving correspondence is limited, which could lead to negative effects.
2. What are the current issues with MLLM benchmarks?
Today’s MLLM evaluations have certain issues. Many benchmarks mix tests of different abilities, making model diagnostics less clear. For example, benchmarks like MMMU place greater demands on language models than on vision capabilities, resulting in the observation that simply increasing the size of the LLM yields more benefits than enhancing vision features. Additionally, current benchmarks lack coverage of scenarios involving text, hand-drawn, and line-drawing images, with classification being relatively coarse, leading to less intuitive analysis of model performance. The community generally hasn’t yet separated the M (multimodal) part from the LLM part in MLLMs for individual study. However, I believe that as the low-hanging fruit gets exhausted, people will gradually begin to study these parts separately. On a side note, I would advise against delving too deeply into LLMs without sufficient resources in the MLLM community. As someone working on LLMs in a mid-sized company, I can attest to the countless pitfalls in LLMs—so many that even Professor Percy Liang once told me he doesn’t have the energy to touch MLLMs anymore. Who knows how many more variables would be introduced? Until you fully understand LLMs, you won’t know how much or in what ways adding a new modality impacts the LLM itself.
3. Is the correspondence of visual representations the same for different types of images?
As shown in the image below, DINOv2's correspondence ability has a significant advantage in natural images (such as photographs). This is why concatenating DINOv2 often yields noticeable improvements on many vision-based benchmarks. However, when it comes to tasks that involve a lot of OCR-based work or tasks involving lines and handwritten content, CLIP features still prove more useful. This makes it challenging for other features to assist CLIP in analyzing images with text. It’s possible that with extensive training, models could learn to prioritize different features in different scenarios. A bold guess would be that if MLLMs used Mixture of Experts (MoE) in the vision feature section, and applied Mistral’s method in the LLM section, it could be beneficial. Of course, MoE training itself has many pitfalls, so it would likely require numerous experiments and struggles.
4. What is the relationship between high correspondence and RAG?
As previously discussed, if the visual representation has high correspondence, then any detail in the image can be matched by our features to similar details in other images or other parts of the same image. Imagine an image where every small patch or detail could be matched with a detailed caption. I guess this is the ultimate goal that CLIP aims to achieve. However, currently, CLIP can only map both feet (left and right) to the generic concept of "foot," meaning that different feet aren't matched with detailed captions, resulting in low correspondence. If we concatenate CLIP and DINOv2 features, even though CLIP often cannot map every detail directly to the language part, DINOv2's high correspondence ensures that once the language part attends to a vision detail (the role of CLIP features), this vision detail can attend to other similar parts (the role of DINOv2), somewhat resembling RAG. It's like an image search that gets augmented, and this explains why channel-wise concatenation is necessary; otherwise, the tokens retrieved wouldn’t be the same. Although this idea is quite abstract, based on this conceptual understanding, I infer that this concatenation method should be very helpful for video tasks. We tested our model on a set of video tasks, and sure enough, we observed significant improvements, which could be seen as a form of validation—though not very strong, so it might require another paper to fully demonstrate. My understanding is that the high-correspondence part of the feature helps to retrieve the parts that CLIP didn’t map well, serving as a reference material retrieved from the database. This concept is fairly intuitive, and we provided a simple proof in our paper. I personally believe that current methods often involve a certain degree of compromise with existing features.
5. What does it mean for vision features to be mapped into text embeddings?
This has been a lingering question in my mind because I was deeply shocked when I first saw LLaVA directly placing vision features into embeddings. Embeddings are like a dictionary, where each entry has a defined meaning and specific encoding. Even if this vision feature was trained contrastively, it wasn’t aligned at the level of LLaMA’s text embeddings, nor was it aligned with another language model. When this vision feature is placed into the embedding space, what exactly is it? I once speculated that it might just be a small caption containing some key words. I even conducted an experiment where I calculated two different distances (L2/cosine) between the feature (after passing through the alignment module) and all text embeddings, then checked whether the closest embeddings could form a readable sentence. The result was gibberish, which suggests that the LLM uses a language that is understandable to the model, or perhaps it maps vision features into a language that humans don’t use, treating it as a foreign language. This led me to realize that cross-modal alignment doesn’t necessarily mean perfect alignment with existing languages; rather, this "foreign language" has a stable distribution. For example, the human blood is red, and the sun is in the sky—such distributions are consistent regardless of the language. Whether you speak Chinese, English, or Russian, the sky is blue, and the sea is also blue. When mapping these concepts, certain mappings can save a lot of effort. I believe this is why many models today are so data-efficient. There is a similar perspective in the NLP community regarding machine translation. Perhaps the language alignment learned by CLIP is still insufficient, because contrastive loss doesn’t necessarily preserve grammar and relationships, which might be a fundamental issue with this type of loss.
Experiments
Here’s a simplified summary of your experiment:
We defined two factors, A and C scores, and calculated the A and C scores for 13 sets of visual representations, comparing them with the performance on various benchmarks. By fitting these scores into a 2nd-degree polynomial, we explored whether this combination could accurately fit the benchmarks. The experimental results were very promising. After completing this in the forward direction, we naturally tested it in reverse. We checked whether using only a few sample points could also produce a usable function to predict the best visual representation. By gradually sampling more points, we found that on average, sampling only 3.88 sets was sufficient to predict the best results out of the 13 experiments. During the experiment, we observed that our law was less effective on OCR-related benchmarks compared to traditional object-based benchmarks. This is likely due to inherent biases in the current correspondence calculations, which overlook domains such as lines and text, hinting at certain biases in existing encoders.
Q&A
Q1: What do the symbols in the problem formulation mean?
- All available encoders in the world
- All possible visual representations, which are various combinations of encoders
- The number of visual representations selected from that people want to test on your model
Previously, you had to train all k MLLMs to determine which visual representation was the best, but now with the AC Policy, you can train just a few k' and fit a function to predict the best one. Moreover, once the function is fitted, you can continue to scale up the search space with virtually no additional cost.
Q2: I'm not sure if I understand it correctly, when computing the A score, do you use the CLIP embedding as a "golden rule" and directly compute the similarity between the given feature and the CLIP embedding? How do you ensure that they are at the same embedding space? Is that a reliable metric?
This is a good question. Yes, A score” use CLIP embedding as a reference. This is a proposal intended for quantifying cross-modal alignment in vision representation. As we wrote in the limitation section - refining A score, there is problem of using CLIP, such unintentionally counting for resolution differences, also different embedded space - transformer based encoder has slightly larger “A score” than convolution based encoder. However, the embedding is after MLP projector, which is trained with all the same data. I think and hope this projector can do some work in bridging the embedding space. I believe this is the best we can do for this stage. I would be very excited to see some simple method can better quantify cross-modal alignment directly without using some reference model, then this can bring the vision-based benchmark fitting from 95% to 100%?!
Q3: We would be very appreciate if you can contribute more questions and discussions!
In the end
I truly hope that more people in the Multimodal Large Language Model field will not only focus on achieving higher scores but also take the next step to explore the reasons behind each trick. It would be great to see more ablation studies rather than simply making unconsidered claims. After all, we pride ourselves on being researchers.