--- license: apache-2.0 datasets: - lmms-lab/LLaVA-OneVision-Data - nyu-visionx/Cambrian-Alignment base_model: - Qwen/Qwen2-7B-Instruct - google/siglip-so400m-patch14-384 --- # Model Card for Model ID Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities. ## Model Sources - **Repository:** http://haochen-wang409.github.io/ross - **Paper:** https://arxiv.org/pdf/2410.09575 ## Install If you are not using Linux, do *NOT* proceed. 1. Clone this repository and navigate to LLaVA folder ```bash git clone https://github.com/Haochen-Wang409/ross.git cd ross ``` 2. Install Package ```Shell conda create -n ross python=3.10 -y conda activate ross pip install --upgrade pip # enable PEP 660 support pip install -e . ``` 3. Install additional packages for training cases ``` pip install -e ".[train]" pip install flash-attn --no-build-isolation ``` ## Usage ```python import torch from PIL import Image from ross.model.builder import load_pretrained_model from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token from ross.eval.run_llava import eval_model model_path = "HaochenWang/ross-qwen2-7b" tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=None, model_name=get_model_name_from_path(model_path) ) model.cuda() model.eval() image = Image.open("...") prompt = "..." images_tensor = process_images( images, image_processor, model.config, ).cuda() input_ids = tokenizer_image_token( prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt", ).unsqueeze(0).cuda() with torch.inference_mode(): output_ids = model.generate( input_ids, images=images_tensor, do_sample=True, temperature=0.8, top_p=0.7, top_k=20, num_beams=5, max_new_tokens=512, use_cache=True, ) outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() print(outputs) ``` ## Citation If you find Ross useful for your research and applications, please cite using this BibTeX: ```bibtex @article{wang2024ross, title={Reconstructive visual instruction tuning}, author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang}, journal={arXiv preprint arXiv:2410.09575}, year={2024} } ```