File size: 8,476 Bytes

eb1c915

---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
base_model:
  - OpenGVLab/InternVL2.5-26B
base_model_relation: merge
language:
  - multilingual
tags:
  - Sa2VA
  - custom_code
---

# Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

[\[📂 GitHub\]](https://github.com/magic-research/Sa2VA)
[\[📜 Sa2VA paper\]](https://arxiv.org/abs/2501.04001)
[\[🚀 Quick Start\]](#quick-start) 



## Introduction

Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.

## Sa2VA Family

We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.

| Model Name |                             Base MLLM                              |                                Language Part                                |                        HF Link                        |
|:----------:|:------------------------------------------------------------------:|:---------------------------------------------------------------------------:|:-----------------------------------------------------:|
|  Sa2VA-1B  | [InternVL2.5-1B](https://huggingface.co/OpenGVLab/InternVL2_5-1B)  | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)  | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-1B)  |
|  Sa2VA-4B  | [InternVL2.5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B)  |   [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)    | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-4B)  |
|  Sa2VA-8B  | [InternVL2.5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B)  | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat)  | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-8B)  |
| Sa2VA-26B  | [InternVL2.5-26B](https://huggingface.co/OpenGVLab/InternVL2_5-26B) | [internlm2_5-20b-chat](https://huggingface.co/internlm/internlm2_5-20b-chat) | [🤗 link](https://huggingface.co/ByteDance/Sa2VA-26B) |

## Sa2VA Performance
| Model Name |   MME    |   MMBench   | RefCOCO | RefCOCO+ | RefCOCOg | MeVIS (val_u) | DAVIS |
|:----------:|:--------:|:----:|:-------:|:--------:|:--------:|:-------------:|:-----:|
|  Sa2VA-1B  | 1504/434 | 71.9 |  79.6   |   73.6   |   77.7   |     53.4      | 69.5  |
|  Sa2VA-4B  | 1691/610 | 81.8 |  82.4   |   77.6   |   79.7   |     55.9      | 73.7  |
|  Sa2VA-8B  | 1690/610 | 84.4 |  82.6   |   78.0   |   80.3   |     58.9      | 75.9  |
| Sa2VA-26B | 1698/653 | 85.8 |  82.9   |   79.3   |   81.2   |     61.8      | 78.6  |


## Quick Start

We provide an example code to run `Sa2VA` using `transformers`.

```python
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import numpy as np
import os

def get_rank_and_world_size():
    rank = int(os.environ.get('RANK', 0))
    world_size = int(os.environ.get('WORLD_SIZE', 1))
    return rank, world_size

def split_model(model_name):
    import math
    device_map = {}
    num_gpus = torch.cuda.device_count()
    rank, world_size = get_rank_and_world_size()
    num_gpus = num_gpus // world_size

    num_layers = {'Sa2VA-8B': 32, 'Sa2VA-26B': 48,
                  'Sa2VA-38B': 64, 'Sa2VA-78B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as 0.8 GPU.
    num_layers_per_gpu = math.ceil(num_layers / (num_gpus - 0.2))
    num_layers_per_gpu = [num_layers_per_gpu] * num_gpus
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.8)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = rank + world_size * i
            layer_cnt += 1
    device_map['vision_model'] = rank
    device_map['mlp1'] = rank
    device_map['language_model.model.tok_embeddings'] = rank
    device_map['language_model.model.embed_tokens'] = rank
    device_map['language_model.output'] = rank
    device_map['language_model.model.norm'] = rank
    device_map['language_model.lm_head'] = rank
    device_map[f'language_model.model.layers.{num_layers - 1}'] = rank
    device_map['grounding_encoder'] = rank
    device_map['text_hidden_fcs'] = rank
    return device_map

# load the model and tokenizer
path = "ByteDance/Sa2VA-26B"
device_map = split_model("Sa2VA-26B")
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map,
).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# for image chat
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Please describe the image."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer

# for image chat with segmentation output
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Could you please give me a brief description of the image? Please respond with interleaved segmentation masks for the corresponding parts of the answer."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
masks = return_dict['prediction_masks']  # segmentation masks, list(np.array(1, h, w), ...)
    
# for chat with visual prompt (mask format) input
mask_prompts = np.load('/PATH/TO/pred_masks.npy') # np.array(n_prompts, h, w)
image_path = "/PATH/TO/IMAGE"
text_prompts = "<image>Can you provide me with a detailed description of the region in the picture marked by region1."
image = Image.open(image_path).convert('RGB')
input_dict = {
    'image': image,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': mask_prompts,
    'tokenizer': tokenizer,
    }
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer

# for video chat
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
if len(images_paths) > 5:  # uniformly sample 5 frames
    step = (len(images_paths) - 1) // (5 - 1)
    images_paths = [images_paths[0]] + images_paths[1:-1][::step][1:] + [images_paths[-1]]
text_prompts = "<image>Please describe the video."
input_dict = {
    'video': images_paths,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer


# for video chat with segmentation mask output
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the person."
input_dict = {
    'video': images_paths,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"] # the text format answer
masks = return_dict['prediction_masks']  # segmentation masks, list(np.array(n_frames, h, w), ...)
```

## Citation

If you find this project useful in your research, please consider citing:

```BibTeX
@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv preprint},
  year={2025}
}
```