---
license: apache-2.0
datasets:
- lmms-lab/LLaVA-OneVision-Data
- nyu-visionx/Cambrian-Alignment
base_model:
- Qwen/Qwen2-7B-Instruct
- google/siglip-so400m-patch14-384
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data. 
It is an auto-regressive language model, based on the transformer architecture.
It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities.

## Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** http://haochen-wang409.github.io/ross
- **Paper:** https://arxiv.org/pdf/2410.09575

## Install

If you are not using Linux, do *NOT* proceed.

1. Clone this repository and navigate to LLaVA folder
```bash
git clone https://github.com/Haochen-Wang409/ross.git
cd ross
```

2. Install Package
```Shell
conda create -n ross python=3.10 -y
conda activate ross
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
```

3. Install additional packages for training cases
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```

## Usage
```python
import torch
from PIL import Image

from ross.model.builder import load_pretrained_model
from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from ross.eval.run_llava import eval_model

model_path = "HaochenWang/ross-qwen2-7b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

model.cuda()
model.eval()

image = Image.open("...")
prompt = "..."

images_tensor = process_images(
    images,
    image_processor,
    model.config,
).cuda()

input_ids = tokenizer_image_token(
    prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
).unsqueeze(0).cuda()

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=images_tensor,
        do_sample=True,
        temperature=0.8,
        top_p=0.7,
        top_k=20,
        num_beams=5,
        max_new_tokens=512,
        use_cache=True,
    )

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(outputs)
```

## Citation

If you find Ross useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{wang2024ross,
  title={Reconstructive visual instruction tuning},
  author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
  journal={arXiv preprint arXiv:2410.09575},
  year={2024}
}
```