Safetensors
ross_qwen2
ross-qwen2-7b / README.md
HaochenWang's picture
Upload README.md
565ca11 verified
|
raw
history blame
2.73 kB
metadata
license: apache-2.0
datasets:
  - lmms-lab/LLaVA-OneVision-Data
  - nyu-visionx/Cambrian-Alignment
base_model:
  - Qwen/Qwen2-7B-Instruct
  - google/siglip-so400m-patch14-384

Model Card for Model ID

Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities.

Model Sources

Install

If you are not using Linux, do NOT proceed.

  1. Clone this repository and navigate to LLaVA folder
git clone https://github.com/Haochen-Wang409/ross.git
cd ross
  1. Install Package
conda create -n ross python=3.10 -y
conda activate ross
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Usage

import torch
from PIL import Image

from ross.model.builder import load_pretrained_model
from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from ross.eval.run_llava import eval_model

model_path = "HaochenWang/ross-qwen2-7b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

model.cuda()
model.eval()

image = Image.open("...")
prompt = "..."

images_tensor = process_images(
    images,
    image_processor,
    model.config,
).cuda()

input_ids = tokenizer_image_token(
    prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
).unsqueeze(0).cuda()

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=images_tensor,
        do_sample=True,
        temperature=0.8,
        top_p=0.7,
        top_k=20,
        num_beams=5,
        max_new_tokens=512,
        use_cache=True,
    )

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(outputs)

Citation

If you find Ross useful for your research and applications, please cite using this BibTeX:

@article{wang2024ross,
  title={Reconstructive visual instruction tuning},
  author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
  journal={arXiv preprint arXiv:2410.09575},
  year={2024}
}