HaochenWang commited on
Commit
5fb7f6b
·
verified ·
1 Parent(s): a4dee5e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -3
README.md CHANGED
@@ -1,3 +1,109 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - lmms-lab/LLaVA-OneVision-Data
5
+ - nyu-visionx/Cambrian-Alignment
6
+ base_model:
7
+ - lmsys/vicuna-13b-v1.5
8
+ - google/siglip-so400m-patch14-384
9
+ ---
10
+ # Model Card for Model ID
11
+
12
+ <!-- Provide a quick summary of what the model is/does. -->
13
+
14
+ Ross is an open-source multimodal-chatbot trained by fine-tuning Qwen2/Vicuna on multimodal instruction-following data.
15
+ It is an auto-regressive language model, based on the transformer architecture.
16
+ It is incorperated with an image reconstruction objective for enhanced multimodal comprehension capabilities.
17
+
18
+ ## Model Sources
19
+
20
+ <!-- Provide the basic links for the model. -->
21
+
22
+ - **Repository:** http://haochen-wang409.github.io/ross
23
+ - **Paper:** https://arxiv.org/pdf/2410.09575
24
+
25
+ ## Install
26
+
27
+ If you are not using Linux, do *NOT* proceed.
28
+
29
+ 1. Clone this repository and navigate to LLaVA folder
30
+ ```bash
31
+ git clone https://github.com/Haochen-Wang409/ross.git
32
+ cd ross
33
+ ```
34
+
35
+ 2. Install Package
36
+ ```Shell
37
+ conda create -n ross python=3.10 -y
38
+ conda activate ross
39
+ pip install --upgrade pip # enable PEP 660 support
40
+ pip install -e .
41
+ ```
42
+
43
+ 3. Install additional packages for training cases
44
+ ```
45
+ pip install -e ".[train]"
46
+ pip install flash-attn --no-build-isolation
47
+ ```
48
+
49
+ ## Usage
50
+ ```python
51
+ import torch
52
+ from PIL import Image
53
+
54
+ from ross.model.builder import load_pretrained_model
55
+ from ross.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
56
+ from ross.eval.run_llava import eval_model
57
+
58
+ model_path = "HaochenWang/ross-vicuna-13b"
59
+
60
+ tokenizer, model, image_processor, context_len = load_pretrained_model(
61
+ model_path=model_path,
62
+ model_base=None,
63
+ model_name=get_model_name_from_path(model_path)
64
+ )
65
+
66
+ model.cuda()
67
+ model.eval()
68
+
69
+ image = Image.open("...")
70
+ prompt = "..."
71
+
72
+ images_tensor = process_images(
73
+ images,
74
+ image_processor,
75
+ model.config,
76
+ ).cuda()
77
+
78
+ input_ids = tokenizer_image_token(
79
+ prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt",
80
+ ).unsqueeze(0).cuda()
81
+
82
+ with torch.inference_mode():
83
+ output_ids = model.generate(
84
+ input_ids,
85
+ images=images_tensor,
86
+ do_sample=True,
87
+ temperature=0.8,
88
+ top_p=0.7,
89
+ top_k=20,
90
+ num_beams=5,
91
+ max_new_tokens=512,
92
+ use_cache=True,
93
+ )
94
+
95
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
96
+ print(outputs)
97
+ ```
98
+
99
+ ## Citation
100
+
101
+ If you find Ross useful for your research and applications, please cite using this BibTeX:
102
+ ```bibtex
103
+ @article{wang2024ross,
104
+ title={Reconstructive visual instruction tuning},
105
+ author={Wang, Haochen and Zheng, Anlin and Zhao, Yucheng and Wang, Tiancai and Ge, Zheng and Zhang, Xiangyu and Zhang, Zhaoxiang},
106
+ journal={arXiv preprint arXiv:2410.09575},
107
+ year={2024}
108
+ }
109
+ ```