|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- lmms-lab/LLaVA-OneVision-Data |
|
- nyu-visionx/Cambrian-Alignment |
|
base_model: |
|
- Qwen/Qwen2-7B-Instruct |
|
- google/siglip-so400m-patch14-384 |
|
--- |
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data. |
|
It is an auto-regressive language model, based on the transformer architecture. |
|
It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities. |
|
|
|
## Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** http://haochen-wang409.github.io/ross |
|
- **Paper:** https://arxiv.org/pdf/2410.09575 |
|
|
|
## Install |
|
|
|
If you are not using Linux, do *NOT* proceed. |
|
|
|
1. Clone this repository and navigate to LLaVA folder |
|
```bash |
|
git clone https://github.com/Haochen-Wang409/ross.git |
|
cd ross |
|
``` |
|
|
|
2. Install Package |
|
```Shell |
|
conda create -n ross python=3.10 -y |
|
conda activate ross |
|
pip install --upgrade pip # enable PEP 660 support |
|
pip install -e . |
|
``` |
|
|
|
3. Install additional packages for training cases |
|
``` |
|
pip install -e ".[train]" |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
## Usage |
|
```python |
|
import torch |
|
from PIL import Image |
|
|
|
from ross.model.builder import load_pretrained_model |
|
from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token |
|
from ross.eval.run_llava import eval_model |
|
|
|
model_path = "HaochenWang/ross-qwen2-7b" |
|
|
|
tokenizer, model, image_processor, context_len = load_pretrained_model( |
|
model_path=model_path, |
|
model_base=None, |
|
model_name=get_model_name_from_path(model_path) |
|
) |
|
|
|
model.cuda() |
|
model.eval() |
|
|
|
image = Image.open("...") |
|
prompt = "..." |
|
|
|
images_tensor = process_images( |
|
images, |
|
image_processor, |
|
model.config, |
|
).cuda() |
|
|
|
input_ids = tokenizer_image_token( |
|
prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt", |
|
).unsqueeze(0).cuda() |
|
|
|
with torch.inference_mode(): |
|
output_ids = model.generate( |
|
input_ids, |
|
images=images_tensor, |
|
do_sample=True, |
|
temperature=0.8, |
|
top_p=0.7, |
|
top_k=20, |
|
num_beams=5, |
|
max_new_tokens=512, |
|
use_cache=True, |
|
) |
|
|
|
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() |
|
print(outputs) |
|
``` |
|
|
|
## Citation |
|
|
|
If you find Ross useful for your research and applications, please cite using this BibTeX: |
|
```bibtex |
|
@article{wang2024ross, |
|
title={Reconstructive visual instruction tuning}, |
|
author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang}, |
|
journal={arXiv preprint arXiv:2410.09575}, |
|
year={2024} |
|
} |
|
``` |