HaochenWang
/

ross-qwen2-7b

Model card Files Files and versions Community

ross-qwen2-7b / README.md

HaochenWang's picture

Upload README.md

565ca11 verified about 1 month ago

|

history blame contribute delete

2.73 kB

	---
	license: apache-2.0
	datasets:
	- lmms-lab/LLaVA-OneVision-Data
	- nyu-visionx/Cambrian-Alignment
	base_model:
	- Qwen/Qwen2-7B-Instruct
	- google/siglip-so400m-patch14-384
	---
	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data.
	It is an auto-regressive language model, based on the transformer architecture.
	It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities.

	## Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: http://haochen-wang409.github.io/ross
	- Paper: https://arxiv.org/pdf/2410.09575

	## Install

	If you are not using Linux, do NOT proceed.

	1. Clone this repository and navigate to LLaVA folder
	```bash
	git clone https://github.com/Haochen-Wang409/ross.git
	cd ross
	```

	2. Install Package
	```Shell
	conda create -n ross python=3.10 -y
	conda activate ross
	pip install --upgrade pip # enable PEP 660 support
	pip install -e .
	```

	3. Install additional packages for training cases
	```
	pip install -e ".[train]"
	pip install flash-attn --no-build-isolation
	```

	## Usage
	```python
	import torch
	from PIL import Image

	from ross.model.builder import load_pretrained_model
	from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
	from ross.eval.run_llava import eval_model

	model_path = "HaochenWang/ross-qwen2-7b"

	tokenizer, model, image_processor, context_len = load_pretrained_model(
	model_path=model_path,
	model_base=None,
	model_name=get_model_name_from_path(model_path)
	)

	model.cuda()
	model.eval()

	image = Image.open("...")
	prompt = "..."

	images_tensor = process_images(
	images,
	image_processor,
	model.config,
	).cuda()

	input_ids = tokenizer_image_token(
	prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
	).unsqueeze(0).cuda()

	with torch.inference_mode():
	output_ids = model.generate(
	input_ids,
	images=images_tensor,
	do_sample=True,
	temperature=0.8,
	top_p=0.7,
	top_k=20,
	num_beams=5,
	max_new_tokens=512,
	use_cache=True,
	)

	outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(outputs)
	```

	## Citation

	If you find Ross useful for your research and applications, please cite using this BibTeX:
	```bibtex
	@article{wang2024ross,
	title={Reconstructive visual instruction tuning},
	author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
	journal={arXiv preprint arXiv:2410.09575},
	year={2024}
	}
	```