Model
We used MLCD as the Vision Encoder in LLaVA-Next.
Data
Our model was trained on publicly available data from the LLaVA-Pretrain and LLaVA-NeXT-Data datasets.
How to eval
pip install lmms-eval==0.2.0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m accelerate.commands.launch \
--main_process_port=12581 \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained=DeepGlint-AI/llava-mlcd-qwen2.5-7b,conv_template=qwen_1_5 \
--tasks mmbench,mme,mmmu,ocrbench,scienceqa,scienceqa_img,seedbench,gqa,pope,textvqa_val,ai2d,chartqa,docvqa_val,infovqa_val,mmstar \
--batch_size 1 \
--log_samples \
--log_samples_suffix mlcd_llava_qwen2_7b \
--output_path ./log
Performance and Limitations
In our experiments, we replaced the CLIP model in LLaVA-NeXT with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used Qwen2.5-7B. The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
---|---|---|
LLM | Qwen2.5-7B | Qwen2.5-7B |
AI2D | 76.98 | 73.15 |
ScienceQA_img | 78.09 | 76.35 |
GQA | 64.17 | 63.31 |
InfoVQA_val | 43.48 | 38.88 |
MMBench_cn_dev | 74.83 | 72.51 |
MMBench_en_dev | 76.37 | 74.57 |
MME(cognition) | 432 | 384 |
MME(perception) | 1598 | 1512 |
SeedBench | 68.20 | 66.80 |
SeedBench_img | 73.75 | 72.72 |
MMStar | 50.98 | 48.98 |
MMMU | 44.30 | 44.20 |
OCRBench | 531.00 | 525.00 |
ChartQA | 67.84 | 66.52 |
DocVQA_val | 76.46 | 75.21 |
POPE | 88.69 | 88.83 |
TextVQA_val | 61.69 | 62.47 |
C. Limitations
Models with larger datasets will perform better on more tasks. We are currently training such models and will soon make them available.
Acknowledgments
We would like to express our gratitude to Yumeng Wang for his significant contributions to the experimental validation in MLLMs.
- Downloads last month
- 40