HaochenWang
/

ross-vicuna-13b

Model card Files Files and versions Community

HaochenWang commited on 10 days ago

Commit

5fb7f6b

·

verified ·

1 Parent(s): a4dee5e

Upload README.md

Files changed (1) hide show

README.md +109 -3

README.md CHANGED Viewed

@@ -1,3 +1,109 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+datasets:
+- lmms-lab/LLaVA-OneVision-Data
+- nyu-visionx/Cambrian-Alignment
+base_model:
+- lmsys/vicuna-13b-v1.5
+- google/siglip-so400m-patch14-384
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data.
+It is an auto-regressive language model, based on the transformer architecture.
+It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities.
+## Model Sources
+<!-- Provide the basic links for the model. -->
+- **Repository:** http://haochen-wang409.github.io/ross
+- **Paper:** https://arxiv.org/pdf/2410.09575
+## Install
+If you are not using Linux, do *NOT* proceed.
+1. Clone this repository and navigate to LLaVA folder
+```bash
+git clone https://github.com/Haochen-Wang409/ross.git
+cd ross
+```
+2. Install Package
+```Shell
+conda create -n ross python=3.10 -y
+conda activate ross
+pip install --upgrade pip  # enable PEP 660 support
+pip install -e .
+```
+3. Install additional packages for training cases
+```
+pip install -e ".[train]"
+pip install flash-attn --no-build-isolation
+```
+## Usage
+```python
+import torch
+from PIL import Image
+from ross.model.builder import load_pretrained_model
+from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
+from ross.eval.run_llava import eval_model
+model_path = "HaochenWang/ross-vicuna-13b"
+tokenizer, model, image_processor, context_len = load_pretrained_model(
+    model_path=model_path,
+    model_base=None,
+    model_name=get_model_name_from_path(model_path)
+)
+model.cuda()
+model.eval()
+image = Image.open("...")
+prompt = "..."
+images_tensor = process_images(
+    images,
+    image_processor,
+    model.config,
+).cuda()
+input_ids = tokenizer_image_token(
+    prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
+).unsqueeze(0).cuda()
+with torch.inference_mode():
+    output_ids = model.generate(
+        input_ids,
+        images=images_tensor,
+        do_sample=True,
+        temperature=0.8,
+        top_p=0.7,
+        top_k=20,
+        num_beams=5,
+        max_new_tokens=512,
+        use_cache=True,
+    )
+outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+print(outputs)
+```
+## Citation
+If you find Ross useful for your research and applications, please cite using this BibTeX:
+```bibtex
+@article{wang2024ross,
+  title={Reconstructive visual instruction tuning},
+  author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
+  journal={arXiv preprint arXiv:2410.09575},
+  year={2024}
+}
+```