Safetensors
ross_qwen2
ross-qwen2-7b / README.md
HaochenWang's picture
Upload README.md
565ca11 verified
---
license: apache-2.0
datasets:
- lmms-lab/LLaVA-OneVision-Data
- nyu-visionx/Cambrian-Alignment
base_model:
- Qwen/Qwen2-7B-Instruct
- google/siglip-so400m-patch14-384
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data.
It is an auto-regressive language model, based on the transformer architecture.
It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities.
## Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** http://haochen-wang409.github.io/ross
- **Paper:** https://arxiv.org/pdf/2410.09575
## Install
If you are not using Linux, do *NOT* proceed.
1. Clone this repository and navigate to LLaVA folder
```bash
git clone https://github.com/Haochen-Wang409/ross.git
cd ross
```
2. Install Package
```Shell
conda create -n ross python=3.10 -y
conda activate ross
pip install --upgrade pip # enable PEP 660 support
pip install -e .
```
3. Install additional packages for training cases
```
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
```
## Usage
```python
import torch
from PIL import Image
from ross.model.builder import load_pretrained_model
from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from ross.eval.run_llava import eval_model
model_path = "HaochenWang/ross-qwen2-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
model.cuda()
model.eval()
image = Image.open("...")
prompt = "..."
images_tensor = process_images(
images,
image_processor,
model.config,
).cuda()
input_ids = tokenizer_image_token(
prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
).unsqueeze(0).cuda()
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=images_tensor,
do_sample=True,
temperature=0.8,
top_p=0.7,
top_k=20,
num_beams=5,
max_new_tokens=512,
use_cache=True,
)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(outputs)
```
## Citation
If you find Ross useful for your research and applications, please cite using this BibTeX:
```bibtex
@article{wang2024ross,
title={Reconstructive visual instruction tuning},
author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
journal={arXiv preprint arXiv:2410.09575},
year={2024}
}
```