shimmyshimmer commited on
Commit
cc84fba
1 Parent(s): f9ab712

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -194
README.md CHANGED
@@ -1,199 +1,134 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
  ---
2
+ license: other
3
+ license_name: qwen
4
+ license_link: https://huggingface.co/Qwen/QVQ-72B-Preview/blob/main/LICENSE
5
+ language:
6
+ - en
7
+ pipeline_tag: image-text-to-text
8
+ base_model: Qwen/QVQ-72B-Preview
9
+ tags:
10
+ - chat
11
+ - qwen
12
  library_name: transformers
 
13
  ---
14
 
 
15
 
16
+ # QVQ-72B-Preview
17
+
18
+ ## Introduction
19
+
20
+ **QVQ-72B-Preview** is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities.
21
+
22
+ ## Performance
23
+
24
+ | | **QVQ-72B-Preview** | o1-2024-12-17 | gpt-4o-2024-05-13 | Claude3.5 Sonnet-20241022 | Qwen2VL-72B |
25
+ |----------------|-----------------|---------------|-------------------|----------------------------|-------------|
26
+ | MMMU(val) | 70.3 | 77.3 | 69.1 | 70.4 | 64.5 |
27
+ | MathVista(mini) | 71.4 | 71.0 | 63.8 | 65.3 | 70.5 |
28
+ | MathVision(full) | 35.9 | – | 30.4 | 35.6 | 25.9 |
29
+ | OlympiadBench | 20.4 | – | 25.9 | – | 11.2 |
30
+
31
+
32
+ **QVQ-72B-Preview** has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark, showcasing QVQ's powerful ability in multidisciplinary understanding and reasoning. Furthermore, the significant improvements on MathVision highlight the model's progress in mathematical reasoning tasks. OlympiadBench also demonstrates the model's enhanced ability to tackle challenging problems.
33
+
34
+ ***But It's Not All Perfect: Acknowledging the Limitations***
35
+
36
+ While **QVQ-72B-Preview** exhibits promising performance that surpasses expectations, it’s important to acknowledge several limitations:
37
+
38
+ 1. **Language Mixing and Code-Switching:** The model might occasionally mix different languages or unexpectedly switch between them, potentially affecting the clarity of its responses.
39
+ 2. **Recursive Reasoning Loops:** There's a risk of the model getting caught in recursive reasoning loops, leading to lengthy responses that may not even arrive at a final answer.
40
+ 3. **Safety and Ethical Considerations:** Robust safety measures are needed to ensure reliable and safe performance. Users should exercise caution when deploying this model.
41
+ 4. **Performance and Benchmark Limitations:** Despite the improvements in visual reasoning, QVQ doesn’t entirely replace the capabilities of Qwen2-VL-72B. During multi-step visual reasoning, the model might gradually lose focus on the image content, leading to hallucinations. Moreover, QVQ doesn’t show significant improvement over Qwen2-VL-72B in basic recognition tasks like identifying people, animals, or plants.
42
+
43
+ Note: Currently, the model only supports single-round dialogues and image outputs. It does not support video inputs.
44
+ ## Quickstart
45
+
46
+ We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
47
+
48
+ ```bash
49
+ pip install qwen-vl-utils
50
+ ```
51
+
52
+ Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
53
+
54
+ ```python
55
+ from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
56
+ from qwen_vl_utils import process_vision_info
57
+
58
+ # default: Load the model on the available device(s)
59
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
60
+ "Qwen/QVQ-72B-Preview", torch_dtype="auto", device_map="auto"
61
+ )
62
+
63
+ # default processer
64
+ processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
65
+
66
+ # The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
67
+ # min_pixels = 256*28*28
68
+ # max_pixels = 1280*28*28
69
+ # processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview", min_pixels=min_pixels, max_pixels=max_pixels)
70
+
71
+ messages = [
72
+ {
73
+ "role": "system",
74
+ "content": [
75
+ {"type": "text", "text": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}
76
+ ],
77
+ },
78
+ {
79
+ "role": "user",
80
+ "content": [
81
+ {
82
+ "type": "image",
83
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/QVQ/demo.png",
84
+ },
85
+ {"type": "text", "text": "What value should be filled in the blank space?"},
86
+ ],
87
+ }
88
+ ]
89
+
90
+ # Preparation for inference
91
+ text = processor.apply_chat_template(
92
+ messages, tokenize=False, add_generation_prompt=True
93
+ )
94
+ image_inputs, video_inputs = process_vision_info(messages)
95
+ inputs = processor(
96
+ text=[text],
97
+ images=image_inputs,
98
+ videos=video_inputs,
99
+ padding=True,
100
+ return_tensors="pt",
101
+ )
102
+ inputs = inputs.to("cuda")
103
+
104
+ # Inference: Generation of the output
105
+ generated_ids = model.generate(**inputs, max_new_tokens=8192)
106
+ generated_ids_trimmed = [
107
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
108
+ ]
109
+ output_text = processor.batch_decode(
110
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
111
+ )
112
+ print(output_text)
113
+ ```
114
+
115
+ ## Citation
116
+
117
+ If you find our work helpful, feel free to give us a cite.
118
+
119
+ ```
120
+ @misc{qvq-72b-preview,
121
+ title = {QVQ: To See the World with Wisdom},
122
+ url = {https://qwenlm.github.io/blog/qvq-72b-preview/},
123
+ author = {Qwen Team},
124
+ month = {December},
125
+ year = {2024}
126
+ }
127
+
128
+ @article{Qwen2VL,
129
+ title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
130
+ author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
131
+ journal={arXiv preprint arXiv:2409.12191},
132
+ year={2024}
133
+ }
134
+ ```