Model Card for Model ID

Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities.

Model Sources

Repository: http://haochen-wang409.github.io/ross
Paper: https://arxiv.org/pdf/2410.09575

Install

If you are not using Linux, do NOT proceed.

Clone this repository and navigate to LLaVA folder

git clone https://github.com/Haochen-Wang409/ross.git
cd ross

Install Package

conda create -n ross python=3.10 -y
conda activate ross
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Usage

import torch
from PIL import Image

from ross.model.builder import load_pretrained_model
from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from ross.eval.run_llava import eval_model

model_path = "HaochenWang/ross-vicuna-13b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

model.cuda()
model.eval()

image = Image.open("...")
prompt = "..."

images_tensor = process_images(
    images,
    image_processor,
    model.config,
).cuda()

input_ids = tokenizer_image_token(
    prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
).unsqueeze(0).cuda()

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=images_tensor,
        do_sample=True,
        temperature=0.8,
        top_p=0.7,
        top_k=20,
        num_beams=5,
        max_new_tokens=512,
        use_cache=True,
    )

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(outputs)

Citation

If you find Ross useful for your research and applications, please cite using this BibTeX:

@article{wang2024ross,
  title={Reconstructive visual instruction tuning},
  author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
  journal={arXiv preprint arXiv:2410.09575},
  year={2024}
}

HaochenWang
/

ross-vicuna-13b

Model Card for Model ID

Model Sources

Install

Usage

Citation

Model tree for HaochenWang/ross-vicuna-13b

Datasets used to train HaochenWang/ross-vicuna-13b