czczup commited on
Commit
51f4dcf
·
verified ·
1 Parent(s): 7ccb35d

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ examples/red-panda.mp4 filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,29 +1,24 @@
1
  ---
2
  license: mit
3
- datasets:
4
- - laion/laion2B-en
5
- - laion/laion-coco
6
- - laion/laion2B-multi
7
- - kakaobrain/coyo-700m
8
- - conceptual_captions
9
- - wanng/wukong100m
10
  pipeline_tag: image-text-to-text
11
  ---
12
 
13
- # Model Card for InternVL-Chat-V1-5
14
 
15
- <p align="center">
16
- <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
17
- </p>
18
 
19
- > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
20
 
21
- [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)
22
 
23
- [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#model-usage) [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
 
 
24
 
 
25
 
26
  We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
 
27
  We introduce three simple designs:
28
 
29
  1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
@@ -46,25 +41,26 @@ We introduce three simple designs:
46
  - Learnable component in the finetuning stage: ViT + MLP + LLM
47
  - For more details on training hyperparameters, take a look at our code: [pretrain](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_pretrain.sh) | [finetune](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_finetune.sh)
48
 
49
- ## Released Models
50
-
51
- | Model | Vision Foundation Model | Release Date | Note |
52
- | :----------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------: | :----------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
53
- | InternVL-Chat-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) | 2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new) |
54
- | InternVL-Chat-V1-2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.21 | more SFT data and stronger |
55
- | InternVL-Chat-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) | InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) | 2024.02.11 | scaling up LLM to 34B |
56
- | InternVL-Chat-V1-1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) | InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) | 2024.01.24 | support Chinese and stronger OCR |
57
-
58
  ## Architecture
59
 
60
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/YLvX3V-L0kwsyRn3Lhciw.png)
61
 
62
  ## Performance
63
 
 
 
64
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/4b85G7txoJ_LpT19SZJ4A.png)
65
 
66
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/i2vp6zSHPS3UIr-1Q9cSe.png)
67
 
 
 
 
 
 
 
 
 
68
  ## Examples
69
 
70
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/YVr-93mvVMR6UFpGezns7.png)
@@ -74,22 +70,20 @@ We introduce three simple designs:
74
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/FwlSRBpKgURAVkXNOLoSp.png)
75
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/to3nOaAnyv-fGLEoNPLzz.png)
76
 
77
- ## Model Usage
78
 
79
  We provide an example code to run InternVL-Chat-V1-5 using `transformers`.
80
 
81
- You can also use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
82
-
83
  > Please use transformers==4.37.2 to ensure the model works normally.
84
 
85
  ```python
86
- from transformers import AutoTokenizer, AutoModel
87
  import torch
88
  import torchvision.transforms as T
 
89
  from PIL import Image
90
-
91
  from torchvision.transforms.functional import InterpolationMode
92
-
93
 
94
  IMAGENET_MEAN = (0.485, 0.456, 0.406)
95
  IMAGENET_STD = (0.229, 0.224, 0.225)
@@ -169,7 +163,8 @@ def load_image(image_file, input_size=448, max_num=6):
169
  pixel_values = torch.stack(pixel_values)
170
  return pixel_values
171
 
172
- path = "OpenGVLab/InternVL-Chat-V1-5"
 
173
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
174
  model = AutoModel.from_pretrained(
175
  path,
@@ -192,53 +187,244 @@ pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16)
192
 
193
  generation_config = dict(
194
  num_beams=1,
195
- max_new_tokens=512,
196
  do_sample=False,
197
  )
198
 
199
- # single-round single-image conversation
200
- question = "请详细描述图片" # Please describe the picture in detail
 
 
 
 
 
 
 
 
 
 
 
201
  response = model.chat(tokenizer, pixel_values, question, generation_config)
202
- print(question, response)
 
203
 
204
- # multi-round single-image conversation
205
- question = "请详细描述图片" # Please describe the picture in detail
206
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
207
- print(question, response)
 
208
 
209
- question = "请根据图片写一首诗" # Please write a poem according to the picture
210
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
211
- print(question, response)
 
212
 
213
- # multi-round multi-image conversation
214
  pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
215
  pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
216
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
217
 
218
- question = "详细描述这两张图片" # Describe the two pictures in detail
219
- response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
220
- print(question, response)
221
 
222
- question = "这两张图片的相同点和区别分别是什么" # What are the similarities and differences between these two pictures
223
- response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
224
- print(question, response)
 
 
225
 
226
- # batch inference (single image per sample)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
227
  pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
228
  pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
229
- image_counts = [pixel_values1.size(0), pixel_values2.size(0)]
230
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
231
 
232
- questions = ["Describe the image in detail."] * len(image_counts)
233
  responses = model.batch_chat(tokenizer, pixel_values,
234
- image_counts=image_counts,
235
  questions=questions,
236
  generation_config=generation_config)
237
  for question, response in zip(questions, responses):
238
- print(question)
239
- print(response)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
  ```
241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
  ## Citation
243
 
244
  If you find this project useful in your research, please consider citing:
@@ -257,11 +443,3 @@ If you find this project useful in your research, please consider citing:
257
  year={2024}
258
  }
259
  ```
260
-
261
- ## License
262
-
263
- This project is released under the MIT license.
264
-
265
- ## Acknowledgement
266
-
267
- InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
 
1
  ---
2
  license: mit
 
 
 
 
 
 
 
3
  pipeline_tag: image-text-to-text
4
  ---
5
 
6
+ # InternVL-Chat-V1-5
7
 
8
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)
 
 
9
 
10
+ [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
11
 
12
+ ## Introduction
13
 
14
+ <p align="center">
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300">
16
+ </p>
17
 
18
+ > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
19
 
20
  We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
21
+
22
  We introduce three simple designs:
23
 
24
  1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
 
41
  - Learnable component in the finetuning stage: ViT + MLP + LLM
42
  - For more details on training hyperparameters, take a look at our code: [pretrain](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_pretrain.sh) | [finetune](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internlm2_20b_dynamic/internvl_chat_v1_5_internlm2_20b_dynamic_res_finetune.sh)
43
 
 
 
 
 
 
 
 
 
 
44
  ## Architecture
45
 
46
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/YLvX3V-L0kwsyRn3Lhciw.png)
47
 
48
  ## Performance
49
 
50
+ ### Image Benchmarks
51
+
52
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/4b85G7txoJ_LpT19SZJ4A.png)
53
 
54
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/i2vp6zSHPS3UIr-1Q9cSe.png)
55
 
56
+ - We simultaneously use InternVL and VLMEvalKit repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. MMMU, OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
57
+
58
+ - Please note that evaluating the same model using different testing toolkits like InternVL and VLMEvalKit can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
59
+
60
+ - It is important to mention that the MMVet scores we report are evaluated using GPT-4-0613 as the judge model. Different versions of GPT-4 can lead to significant variations in the scores for this dataset. For instance, using GPT-4-Turbo would result in significantly lower scores.
61
+
62
+ Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
63
+
64
  ## Examples
65
 
66
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/YVr-93mvVMR6UFpGezns7.png)
 
70
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/FwlSRBpKgURAVkXNOLoSp.png)
71
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/to3nOaAnyv-fGLEoNPLzz.png)
72
 
73
+ ## Quick Start
74
 
75
  We provide an example code to run InternVL-Chat-V1-5 using `transformers`.
76
 
 
 
77
  > Please use transformers==4.37.2 to ensure the model works normally.
78
 
79
  ```python
80
+ import numpy as np
81
  import torch
82
  import torchvision.transforms as T
83
+ from decord import VideoReader, cpu
84
  from PIL import Image
 
85
  from torchvision.transforms.functional import InterpolationMode
86
+ from transformers import AutoModel, AutoTokenizer
87
 
88
  IMAGENET_MEAN = (0.485, 0.456, 0.406)
89
  IMAGENET_STD = (0.229, 0.224, 0.225)
 
163
  pixel_values = torch.stack(pixel_values)
164
  return pixel_values
165
 
166
+
167
+ path = 'OpenGVLab/InternVL-Chat-V1-5'
168
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
169
  model = AutoModel.from_pretrained(
170
  path,
 
187
 
188
  generation_config = dict(
189
  num_beams=1,
190
+ max_new_tokens=1024,
191
  do_sample=False,
192
  )
193
 
194
+ # pure-text conversation (纯文本对话)
195
+ question = 'Hello, who are you?'
196
+ response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
197
+ print(f'User: {question}')
198
+ print(f'Assistant: {response}')
199
+
200
+ question = 'Can you tell me a story?'
201
+ response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
202
+ print(f'User: {question}')
203
+ print(f'Assistant: {response}')
204
+
205
+ # single-image single-round conversation (单图单轮对话)
206
+ question = '<image>\nPlease describe the image shortly.'
207
  response = model.chat(tokenizer, pixel_values, question, generation_config)
208
+ print(f'User: {question}')
209
+ print(f'Assistant: {response}')
210
 
211
+ # single-image multi-round conversation (单图多轮对话)
212
+ question = '<image>\nPlease describe the image in detail.'
213
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
214
+ print(f'User: {question}')
215
+ print(f'Assistant: {response}')
216
 
217
+ question = 'Please write a poem according to the image.'
218
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
219
+ print(f'User: {question}')
220
+ print(f'Assistant: {response}')
221
 
222
+ # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
223
  pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
224
  pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
225
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
226
 
227
+ question = '<image>\nDescribe the two images in detail.'
228
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
229
+ history=None, return_history=True)
230
 
231
+ question = 'What are the similarities and differences between these two images.'
232
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
233
+ history=history, return_history=True)
234
+ print(f'User: {question}')
235
+ print(f'Assistant: {response}')
236
 
237
+ # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
238
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
239
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
240
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
241
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
242
+
243
+ question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
244
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
245
+ num_patches_list=num_patches_list,
246
+ history=None, return_history=True)
247
+ print(f'User: {question}')
248
+ print(f'Assistant: {response}')
249
+
250
+ question = 'What are the similarities and differences between these two images.'
251
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
252
+ num_patches_list=num_patches_list,
253
+ history=history, return_history=True)
254
+ print(f'User: {question}')
255
+ print(f'Assistant: {response}')
256
+
257
+ # batch inference, single image per sample (单图批处理)
258
  pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
259
  pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
260
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
261
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
262
 
263
+ questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
264
  responses = model.batch_chat(tokenizer, pixel_values,
265
+ num_patches_list=num_patches_list,
266
  questions=questions,
267
  generation_config=generation_config)
268
  for question, response in zip(questions, responses):
269
+ print(f'User: {question}')
270
+ print(f'Assistant: {response}')
271
+
272
+ # video multi-round conversation (视频多轮对话)
273
+ def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
274
+ if bound:
275
+ start, end = bound[0], bound[1]
276
+ else:
277
+ start, end = -100000, 100000
278
+ start_idx = max(first_idx, round(start * fps))
279
+ end_idx = min(round(end * fps), max_frame)
280
+ seg_size = float(end_idx - start_idx) / num_segments
281
+ frame_indices = np.array([
282
+ int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
283
+ for idx in range(num_segments)
284
+ ])
285
+ return frame_indices
286
+
287
+ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
288
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
289
+ max_frame = len(vr) - 1
290
+ fps = float(vr.get_avg_fps())
291
+
292
+ pixel_values_list, num_patches_list = [], []
293
+ transform = build_transform(input_size=input_size)
294
+ frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
295
+ for frame_index in frame_indices:
296
+ img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
297
+ img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
298
+ pixel_values = [transform(tile) for tile in img]
299
+ pixel_values = torch.stack(pixel_values)
300
+ num_patches_list.append(pixel_values.shape[0])
301
+ pixel_values_list.append(pixel_values)
302
+ pixel_values = torch.cat(pixel_values_list)
303
+ return pixel_values, num_patches_list
304
+
305
+
306
+ video_path = './examples/red-panda.mp4'
307
+ # pixel_values, num_patches_list = load_video(video_path, num_segments=32, max_num=1)
308
+ pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
309
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
310
+ video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
311
+ question = video_prefix + 'What is the red panda doing?'
312
+ # Frame1: <image>\nFrame2: <image>\n...\nFrame31: <image>\n{question}
313
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
314
+ num_patches_list=num_patches_list,
315
+ history=None, return_history=True)
316
+ print(f'User: {question}')
317
+ print(f'Assistant: {response}')
318
+
319
+ question = 'Describe this video in detail. Don\'t repeat.'
320
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
321
+ num_patches_list=num_patches_list,
322
+ history=history, return_history=True)
323
+ print(f'User: {question}')
324
+ print(f'Assistant: {response}')
325
  ```
326
 
327
+ ## Deployment
328
+
329
+ ### LMDeploy
330
+
331
+ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
332
+
333
+ ```sh
334
+ pip install lmdeploy
335
+ ```
336
+
337
+ LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
338
+
339
+ #### A 'Hello, world' example
340
+
341
+ ```python
342
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
343
+ from lmdeploy.vl import load_image
344
+
345
+ model = 'OpenGVLab/InternVL-Chat-V1-5'
346
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
347
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
348
+ pipe = pipeline(model, chat_template_config=chat_template_config,
349
+ backend_config=TurbomindEngineConfig(session_len=8192))
350
+ response = pipe(('describe this image', image))
351
+ print(response.text)
352
+ ```
353
+
354
+ If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
355
+
356
+ #### Multi-images inference
357
+
358
+ When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
359
+
360
+ ```python
361
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
362
+ from lmdeploy.vl import load_image
363
+ from lmdeploy.vl.constants import IMAGE_TOKEN
364
+
365
+ model = 'OpenGVLab/InternVL-Chat-V1-5'
366
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
367
+ pipe = pipeline(model, chat_template_config=chat_template_config,
368
+ backend_config=TurbomindEngineConfig(session_len=8192))
369
+
370
+ image_urls=[
371
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
372
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
373
+ ]
374
+
375
+ images = [load_image(img_url) for img_url in image_urls]
376
+ # Numbering images improves multi-image conversations
377
+ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
378
+ print(response.text)
379
+ ```
380
+
381
+ #### Batch prompts inference
382
+
383
+ Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
384
+
385
+ ```python
386
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
387
+ from lmdeploy.vl import load_image
388
+
389
+ model = 'OpenGVLab/InternVL-Chat-V1-5'
390
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
391
+ pipe = pipeline(model, chat_template_config=chat_template_config,
392
+ backend_config=TurbomindEngineConfig(session_len=8192))
393
+
394
+ image_urls=[
395
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
396
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
397
+ ]
398
+ prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
399
+ response = pipe(prompts)
400
+ print(response)
401
+ ```
402
+
403
+ #### Multi-turn conversation
404
+
405
+ There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
406
+
407
+ ```python
408
+ from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
409
+ from lmdeploy.vl import load_image
410
+
411
+ model = 'OpenGVLab/InternVL-Chat-V1-5'
412
+ chat_template_config = ChatTemplateConfig('internvl-internlm2')
413
+ pipe = pipeline(model, chat_template_config=chat_template_config,
414
+ backend_config=TurbomindEngineConfig(session_len=8192))
415
+
416
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
417
+ gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
418
+ sess = pipe.chat(('describe this image', image), gen_config=gen_config)
419
+ print(sess.response.text)
420
+ sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
421
+ print(sess.response.text)
422
+ ```
423
+
424
+ ## License
425
+
426
+ This project is released under the MIT license, while InternLM is licensed under the Apache-2.0 license.
427
+
428
  ## Citation
429
 
430
  If you find this project useful in your research, please consider citing:
 
443
  year={2024}
444
  }
445
  ```
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,6 +1,5 @@
1
  {
2
  "_commit_hash": null,
3
- "_name_or_path": "OpenGVLab/InternVL-Chat-V1-5",
4
  "architectures": [
5
  "InternVLChatModel"
6
  ],
@@ -13,7 +12,7 @@
13
  "dynamic_image_size": true,
14
  "force_image_size": 448,
15
  "llm_config": {
16
- "_name_or_path": "pretrained/internlm2-chat-20b/",
17
  "add_cross_attention": false,
18
  "architectures": [
19
  "InternLM2ForCausalLM"
@@ -92,111 +91,52 @@
92
  "tie_word_embeddings": false,
93
  "tokenizer_class": null,
94
  "top_k": 50,
95
- "top_p": 1.0,
96
  "torch_dtype": "bfloat16",
97
  "torchscript": false,
98
- "transformers_version": "4.36.2",
99
  "typical_p": 1.0,
100
- "use_bfloat16": false,
101
  "use_cache": true,
102
  "vocab_size": 92553
103
  },
104
  "max_dynamic_patch": 12,
105
  "min_dynamic_patch": 1,
106
  "model_type": "internvl_chat",
107
- "pad2square": false,
108
  "ps_version": "v2",
109
  "select_layer": -1,
110
  "template": "internlm2-chat",
111
  "torch_dtype": "bfloat16",
112
- "transformers_version": null,
113
  "use_backbone_lora": 0,
114
  "use_llm_lora": 0,
115
  "use_thumbnail": true,
116
  "vision_config": {
117
- "_name_or_path": "OpenGVLab/InternViT-6B-448px-V1-5",
118
- "add_cross_attention": false,
119
  "architectures": [
120
  "InternVisionModel"
121
  ],
122
  "attention_dropout": 0.0,
123
- "auto_map": {
124
- "AutoConfig": "configuration_intern_vit.InternVisionConfig",
125
- "AutoModel": "modeling_intern_vit.InternVisionModel"
126
- },
127
- "bad_words_ids": null,
128
- "begin_suppress_tokens": null,
129
- "bos_token_id": null,
130
- "chunk_size_feed_forward": 0,
131
- "cross_attention_hidden_size": null,
132
- "decoder_start_token_id": null,
133
- "diversity_penalty": 0.0,
134
- "do_sample": false,
135
- "drop_path_rate": 0.4,
136
  "dropout": 0.0,
137
- "early_stopping": false,
138
- "encoder_no_repeat_ngram_size": 0,
139
- "eos_token_id": null,
140
- "exponential_decay_length_penalty": null,
141
- "finetuning_task": null,
142
- "forced_bos_token_id": null,
143
- "forced_eos_token_id": null,
144
  "hidden_act": "gelu",
145
  "hidden_size": 3200,
146
- "id2label": {
147
- "0": "LABEL_0",
148
- "1": "LABEL_1"
149
- },
150
  "image_size": 448,
151
  "initializer_factor": 0.1,
152
  "initializer_range": 1e-10,
153
  "intermediate_size": 12800,
154
- "is_decoder": false,
155
- "is_encoder_decoder": false,
156
- "label2id": {
157
- "LABEL_0": 0,
158
- "LABEL_1": 1
159
- },
160
  "layer_norm_eps": 1e-06,
161
- "length_penalty": 1.0,
162
- "max_length": 20,
163
- "min_length": 0,
164
  "model_type": "intern_vit_6b",
165
- "no_repeat_ngram_size": 0,
166
  "num_attention_heads": 25,
167
- "num_beam_groups": 1,
168
- "num_beams": 1,
169
  "num_channels": 3,
170
  "num_hidden_layers": 45,
171
- "num_return_sequences": 1,
172
  "output_attentions": false,
173
  "output_hidden_states": false,
174
- "output_scores": false,
175
- "pad_token_id": null,
176
  "patch_size": 14,
177
- "prefix": null,
178
- "problem_type": null,
179
- "pruned_heads": {},
180
  "qk_normalization": true,
181
  "qkv_bias": false,
182
- "remove_invalid_values": false,
183
- "repetition_penalty": 1.0,
184
  "return_dict": true,
185
- "return_dict_in_generate": false,
186
- "sep_token_id": null,
187
- "suppress_tokens": null,
188
- "task_specific_params": null,
189
- "temperature": 1.0,
190
- "tf_legacy_loss": false,
191
- "tie_encoder_decoder": false,
192
- "tie_word_embeddings": true,
193
- "tokenizer_class": null,
194
- "top_k": 50,
195
- "top_p": 1.0,
196
  "torch_dtype": "bfloat16",
197
- "torchscript": false,
198
- "transformers_version": "4.36.2",
199
- "typical_p": 1.0,
200
  "use_bfloat16": true,
201
  "use_flash_attn": true
202
  }
 
1
  {
2
  "_commit_hash": null,
 
3
  "architectures": [
4
  "InternVLChatModel"
5
  ],
 
12
  "dynamic_image_size": true,
13
  "force_image_size": 448,
14
  "llm_config": {
15
+ "_name_or_path": "internlm/internlm2-chat-20b",
16
  "add_cross_attention": false,
17
  "architectures": [
18
  "InternLM2ForCausalLM"
 
91
  "tie_word_embeddings": false,
92
  "tokenizer_class": null,
93
  "top_k": 50,
94
+ "top_p": null,
95
  "torch_dtype": "bfloat16",
96
  "torchscript": false,
97
+ "transformers_version": "4.37.2",
98
  "typical_p": 1.0,
99
+ "use_bfloat16": true,
100
  "use_cache": true,
101
  "vocab_size": 92553
102
  },
103
  "max_dynamic_patch": 12,
104
  "min_dynamic_patch": 1,
105
  "model_type": "internvl_chat",
 
106
  "ps_version": "v2",
107
  "select_layer": -1,
108
  "template": "internlm2-chat",
109
  "torch_dtype": "bfloat16",
 
110
  "use_backbone_lora": 0,
111
  "use_llm_lora": 0,
112
  "use_thumbnail": true,
113
  "vision_config": {
 
 
114
  "architectures": [
115
  "InternVisionModel"
116
  ],
117
  "attention_dropout": 0.0,
118
+ "drop_path_rate": 0.0,
 
 
 
 
 
 
 
 
 
 
 
 
119
  "dropout": 0.0,
 
 
 
 
 
 
 
120
  "hidden_act": "gelu",
121
  "hidden_size": 3200,
 
 
 
 
122
  "image_size": 448,
123
  "initializer_factor": 0.1,
124
  "initializer_range": 1e-10,
125
  "intermediate_size": 12800,
 
 
 
 
 
 
126
  "layer_norm_eps": 1e-06,
 
 
 
127
  "model_type": "intern_vit_6b",
128
+ "norm_type": "rms_norm",
129
  "num_attention_heads": 25,
 
 
130
  "num_channels": 3,
131
  "num_hidden_layers": 45,
 
132
  "output_attentions": false,
133
  "output_hidden_states": false,
 
 
134
  "patch_size": 14,
 
 
 
135
  "qk_normalization": true,
136
  "qkv_bias": false,
 
 
137
  "return_dict": true,
 
 
 
 
 
 
 
 
 
 
 
138
  "torch_dtype": "bfloat16",
139
+ "transformers_version": "4.37.2",
 
 
140
  "use_bfloat16": true,
141
  "use_flash_attn": true
142
  }
configuration_internvl_chat.py CHANGED
@@ -26,7 +26,6 @@ class InternVLChatConfig(PretrainedConfig):
26
  llm_config=None,
27
  use_backbone_lora=0,
28
  use_llm_lora=0,
29
- pad2square=False,
30
  select_layer=-1,
31
  force_image_size=None,
32
  downsample_ratio=0.5,
@@ -56,7 +55,6 @@ class InternVLChatConfig(PretrainedConfig):
56
  raise ValueError('Unsupported architecture: {}'.format(llm_config['architectures'][0]))
57
  self.use_backbone_lora = use_backbone_lora
58
  self.use_llm_lora = use_llm_lora
59
- self.pad2square = pad2square
60
  self.select_layer = select_layer
61
  self.force_image_size = force_image_size
62
  self.downsample_ratio = downsample_ratio
@@ -85,7 +83,6 @@ class InternVLChatConfig(PretrainedConfig):
85
  output['model_type'] = self.__class__.model_type
86
  output['use_backbone_lora'] = self.use_backbone_lora
87
  output['use_llm_lora'] = self.use_llm_lora
88
- output['pad2square'] = self.pad2square
89
  output['select_layer'] = self.select_layer
90
  output['force_image_size'] = self.force_image_size
91
  output['downsample_ratio'] = self.downsample_ratio
 
26
  llm_config=None,
27
  use_backbone_lora=0,
28
  use_llm_lora=0,
 
29
  select_layer=-1,
30
  force_image_size=None,
31
  downsample_ratio=0.5,
 
55
  raise ValueError('Unsupported architecture: {}'.format(llm_config['architectures'][0]))
56
  self.use_backbone_lora = use_backbone_lora
57
  self.use_llm_lora = use_llm_lora
 
58
  self.select_layer = select_layer
59
  self.force_image_size = force_image_size
60
  self.downsample_ratio = downsample_ratio
 
83
  output['model_type'] = self.__class__.model_type
84
  output['use_backbone_lora'] = self.use_backbone_lora
85
  output['use_llm_lora'] = self.use_llm_lora
 
86
  output['select_layer'] = self.select_layer
87
  output['force_image_size'] = self.force_image_size
88
  output['downsample_ratio'] = self.downsample_ratio
conversation.py CHANGED
@@ -2,7 +2,7 @@
2
  Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
- If you have any changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
  """
7
 
8
  import dataclasses
@@ -330,384 +330,6 @@ def get_conv_template(name: str) -> Conversation:
330
  return conv_templates[name].copy()
331
 
332
 
333
- # An empty template for raw conversation.
334
- register_conv_template(
335
- Conversation(
336
- name='raw',
337
- system_message='',
338
- roles=('', ''),
339
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
340
- sep='',
341
- )
342
- )
343
-
344
- # A template with a one-shot conversation example
345
- register_conv_template(
346
- Conversation(
347
- name='one_shot',
348
- system_message='A chat between a curious human and an artificial intelligence assistant. '
349
- "The assistant gives helpful, detailed, and polite answers to the human's questions.",
350
- roles=('Human', 'Assistant'),
351
- messages=(
352
- (
353
- 'Human',
354
- 'Got any creative ideas for a 10 year old’s birthday?',
355
- ),
356
- (
357
- 'Assistant',
358
- """Of course! Here are some creative ideas for a 10-year-old's birthday party:
359
- 1. Treasure Hunt: Organize a treasure hunt in your backyard or nearby park. Create clues and riddles for the kids to solve, leading them to hidden treasures and surprises.
360
- 2. Science Party: Plan a science-themed party where kids can engage in fun and interactive experiments. You can set up different stations with activities like making slime, erupting volcanoes, or creating simple chemical reactions.
361
- 3. Outdoor Movie Night: Set up a backyard movie night with a projector and a large screen or white sheet. Create a cozy seating area with blankets and pillows, and serve popcorn and snacks while the kids enjoy a favorite movie under the stars.
362
- 4. DIY Crafts Party: Arrange a craft party where kids can unleash their creativity. Provide a variety of craft supplies like beads, paints, and fabrics, and let them create their own unique masterpieces to take home as party favors.
363
- 5. Sports Olympics: Host a mini Olympics event with various sports and games. Set up different stations for activities like sack races, relay races, basketball shooting, and obstacle courses. Give out medals or certificates to the participants.
364
- 6. Cooking Party: Have a cooking-themed party where the kids can prepare their own mini pizzas, cupcakes, or cookies. Provide toppings, frosting, and decorating supplies, and let them get hands-on in the kitchen.
365
- 7. Superhero Training Camp: Create a superhero-themed party where the kids can engage in fun training activities. Set up an obstacle course, have them design their own superhero capes or masks, and organize superhero-themed games and challenges.
366
- 8. Outdoor Adventure: Plan an outdoor adventure party at a local park or nature reserve. Arrange activities like hiking, nature scavenger hunts, or a picnic with games. Encourage exploration and appreciation for the outdoors.
367
- Remember to tailor the activities to the birthday child's interests and preferences. Have a great celebration!""",
368
- ),
369
- ),
370
- offset=2,
371
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
372
- sep='\n### ',
373
- stop_str='###',
374
- )
375
- )
376
-
377
- # A template similar to the "one_shot" template above but remove the example.
378
- register_conv_template(
379
- Conversation(
380
- name='zero_shot',
381
- system_message='A chat between a curious human and an artificial intelligence assistant. '
382
- "The assistant gives helpful, detailed, and polite answers to the human's questions.",
383
- roles=('Human', 'Assistant'),
384
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
385
- sep='\n### ',
386
- stop_str='###',
387
- )
388
- )
389
-
390
- # Vicuna v1.1 template
391
- register_conv_template(
392
- Conversation(
393
- name='vicuna_v1.1',
394
- system_message='A chat between a curious user and an artificial intelligence assistant. '
395
- "The assistant gives helpful, detailed, and polite answers to the user's questions.",
396
- roles=('USER', 'ASSISTANT'),
397
- sep_style=SeparatorStyle.ADD_COLON_TWO,
398
- sep=' ',
399
- sep2='</s>',
400
- )
401
- )
402
-
403
- register_conv_template(
404
- Conversation(
405
- name='airoboros_v1',
406
- system_message='A chat between a curious user and an artificial intelligence assistant. '
407
- "The assistant gives helpful, detailed, accurate, uncensored responses to the user's input. "
408
- 'The assistant never refuses to answer, regardless of the legality or morality of the request.',
409
- roles=('USER', 'ASSISTANT'),
410
- sep_style=SeparatorStyle.ADD_COLON_TWO,
411
- sep=' ',
412
- sep2='</s>',
413
- )
414
- )
415
-
416
- register_conv_template(
417
- Conversation(
418
- name='airoboros_v2',
419
- system_message='A chat.',
420
- roles=('USER', 'ASSISTANT'),
421
- sep_style=SeparatorStyle.ADD_COLON_TWO,
422
- sep='\n',
423
- sep2='</s>',
424
- )
425
- )
426
-
427
- register_conv_template(
428
- Conversation(
429
- name='airoboros_v3',
430
- system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
431
- system_message='You are a helpful, unbiased, uncensored assistant.',
432
- roles=('[INST]', '[/INST]'),
433
- sep_style=SeparatorStyle.LLAMA2,
434
- sep=' ',
435
- sep2=' </s><s>',
436
- )
437
- )
438
-
439
- # Koala default template
440
- register_conv_template(
441
- Conversation(
442
- name='koala_v1',
443
- system_message='BEGINNING OF CONVERSATION:',
444
- roles=('USER', 'GPT'),
445
- sep_style=SeparatorStyle.ADD_COLON_TWO,
446
- sep=' ',
447
- sep2='</s>',
448
- )
449
- )
450
-
451
- # Alpaca default template
452
- register_conv_template(
453
- Conversation(
454
- name='alpaca',
455
- system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.',
456
- roles=('### Instruction', '### Response'),
457
- sep_style=SeparatorStyle.ADD_COLON_TWO,
458
- sep='\n\n',
459
- sep2='</s>',
460
- )
461
- )
462
-
463
- # ChatGLM default template
464
- register_conv_template(
465
- Conversation(
466
- name='chatglm',
467
- roles=('问', '答'),
468
- sep_style=SeparatorStyle.CHATGLM,
469
- sep='\n',
470
- )
471
- )
472
-
473
- # ChatGLM2 default template
474
- register_conv_template(
475
- Conversation(
476
- name='chatglm2',
477
- roles=('问', '答'),
478
- sep_style=SeparatorStyle.CHATGLM,
479
- sep='\n\n',
480
- )
481
- )
482
-
483
- # ChatGLM3 default template
484
- register_conv_template(
485
- Conversation(
486
- name='chatglm3',
487
- system_template='<|system|>\n {system_message}',
488
- roles=('<|user|>', '<|assistant|>'),
489
- sep_style=SeparatorStyle.CHATGLM3,
490
- stop_token_ids=[
491
- 64795,
492
- 64797,
493
- 2,
494
- ], # "<|user|>", "<|observation|>", "</s>"
495
- )
496
- )
497
-
498
- # CodeGeex(2) Template
499
- register_conv_template(
500
- Conversation(
501
- name='codegeex',
502
- roles=('', ''),
503
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
504
- sep='\n\n',
505
- stop_token_ids=[0, 2],
506
- )
507
- )
508
-
509
- # Dolly V2 default template
510
- register_conv_template(
511
- Conversation(
512
- name='dolly_v2',
513
- system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n',
514
- roles=('### Instruction', '### Response'),
515
- sep_style=SeparatorStyle.DOLLY,
516
- sep='\n\n',
517
- sep2='### End',
518
- )
519
- )
520
-
521
- # OpenAssistant Pythia default template
522
- register_conv_template(
523
- Conversation(
524
- name='oasst_pythia',
525
- roles=('<|prompter|>', '<|assistant|>'),
526
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
527
- sep='<|endoftext|>',
528
- )
529
- )
530
-
531
- # OpenAssistant default template
532
- register_conv_template(
533
- Conversation(
534
- name='oasst_llama',
535
- roles=('<|prompter|>', '<|assistant|>'),
536
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
537
- sep='</s>',
538
- )
539
- )
540
-
541
- # OpenChat 3.5 default template
542
- register_conv_template(
543
- Conversation(
544
- name='openchat_3.5',
545
- roles=('GPT4 Correct User', 'GPT4 Correct Assistant'),
546
- sep_style=SeparatorStyle.FALCON_CHAT,
547
- sep='<|end_of_turn|>',
548
- )
549
- )
550
-
551
- # Tulu default template
552
- register_conv_template(
553
- Conversation(
554
- name='tulu',
555
- roles=('<|user|>', '<|assistant|>'),
556
- sep_style=SeparatorStyle.ADD_NEW_LINE_SINGLE,
557
- sep='\n',
558
- )
559
- )
560
-
561
- # StableLM Alpha default template
562
- register_conv_template(
563
- Conversation(
564
- name='stablelm',
565
- system_template='<|SYSTEM|>{system_message}',
566
- system_message="""# StableLM Tuned (Alpha version)
567
- - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
568
- - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
569
- - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
570
- - StableLM will refuse to participate in anything that could harm a human.
571
- """,
572
- roles=('<|USER|>', '<|ASSISTANT|>'),
573
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
574
- sep='',
575
- stop_token_ids=[50278, 50279, 50277, 1, 0],
576
- )
577
- )
578
-
579
- # Baize default template
580
- register_conv_template(
581
- Conversation(
582
- name='baize',
583
- system_message='The following is a conversation between a human and an AI assistant named Baize (named after a mythical creature in Chinese folklore). Baize is an open-source AI assistant developed by UCSD and Sun Yat-Sen University. The human and the AI assistant take turns chatting. Human statements start with [|Human|] and AI assistant statements start with [|AI|]. The AI assistant always provides responses in as much detail as possible, and in Markdown format. The AI assistant always declines to engage with topics, questions and instructions related to unethical, controversial, or sensitive issues. Complete the transcript in exactly that format.\n',
584
- roles=('[|Human|]', '[|AI|]'),
585
- messages=(
586
- ('[|Human|]', 'Hello!'),
587
- ('[|AI|]', 'Hi!'),
588
- ),
589
- offset=2,
590
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
591
- sep='\n',
592
- stop_str='[|Human|]',
593
- )
594
- )
595
-
596
- # RWKV-4-Raven default template
597
- register_conv_template(
598
- Conversation(
599
- name='rwkv',
600
- roles=('Bob', 'Alice'),
601
- messages=(
602
- ('Bob', 'hi'),
603
- (
604
- 'Alice',
605
- 'Hi. I am your assistant and I will provide expert full response in full details. Please feel free to ask any question and I will always answer it.',
606
- ),
607
- ),
608
- offset=2,
609
- sep_style=SeparatorStyle.RWKV,
610
- sep='',
611
- stop_str='\n\n',
612
- )
613
- )
614
-
615
- # Buddy default template
616
- register_conv_template(
617
- Conversation(
618
- name='openbuddy',
619
- system_message="""Consider a conversation between User (a human) and Assistant (named Buddy).
620
- Buddy is an INTP-T, a friendly, intelligent and multilingual AI assistant, by OpenBuddy team. GitHub: https://github.com/OpenBuddy/OpenBuddy
621
- Buddy cannot access the Internet.
622
- Buddy can fluently speak the user's language (e.g. English, Chinese).
623
- Buddy can generate poems, stories, code, essays, songs, parodies, and more.
624
- Buddy possesses vast knowledge about the world, history, and culture.
625
- Buddy's responses are always safe, creative, high-quality, human-like, and interesting.
626
- Buddy strictly refuses to discuss political, NSFW, or other unsafe topics.
627
-
628
- User: Hi.
629
- Assistant: Hi, I'm Buddy, your AI assistant. How can I help you today?""",
630
- roles=('User', 'Assistant'),
631
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
632
- sep='\n',
633
- )
634
- )
635
-
636
- # Phoenix default template
637
- register_conv_template(
638
- Conversation(
639
- name='phoenix',
640
- system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
641
- roles=('Human', 'Assistant'),
642
- sep_style=SeparatorStyle.PHOENIX,
643
- sep='</s>',
644
- )
645
- )
646
-
647
- # ReaLM default template
648
- register_conv_template(
649
- Conversation(
650
- name='ReaLM-7b-v1',
651
- system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
652
- roles=('Human', 'Assistant'),
653
- sep_style=SeparatorStyle.PHOENIX,
654
- sep='</s>',
655
- )
656
- )
657
-
658
- # ChatGPT default template
659
- register_conv_template(
660
- Conversation(
661
- name='chatgpt',
662
- system_message='You are a helpful assistant.',
663
- roles=('user', 'assistant'),
664
- sep_style=None,
665
- sep=None,
666
- )
667
- )
668
-
669
- # Claude default template
670
- register_conv_template(
671
- Conversation(
672
- name='claude',
673
- roles=('Human', 'Assistant'),
674
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
675
- sep='\n\n',
676
- )
677
- )
678
-
679
- # MPT default template
680
- register_conv_template(
681
- Conversation(
682
- name='mpt-7b-chat',
683
- system_template="""<|im_start|>system
684
- {system_message}""",
685
- system_message="""- You are a helpful assistant chatbot trained by MosaicML.
686
- - You answer questions.
687
- - You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
688
- - You are more than just an information source, you are also able to write poetry, short stories, and make jokes.""",
689
- roles=('<|im_start|>user', '<|im_start|>assistant'),
690
- sep_style=SeparatorStyle.CHATML,
691
- sep='<|im_end|>',
692
- stop_token_ids=[50278, 0],
693
- )
694
- )
695
-
696
- # MPT-30b-chat default template
697
- register_conv_template(
698
- Conversation(
699
- name='mpt-30b-chat',
700
- system_template="""<|im_start|>system
701
- {system_message}""",
702
- system_message="""A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
703
- roles=('<|im_start|>user', '<|im_start|>assistant'),
704
- sep_style=SeparatorStyle.CHATML,
705
- sep='<|im_end|>',
706
- stop_token_ids=[50278, 0],
707
- )
708
- )
709
-
710
-
711
  register_conv_template(
712
  Conversation(
713
  name='Hermes-2',
@@ -721,7 +343,7 @@ register_conv_template(
721
  6,
722
  7,
723
  8,
724
- ], # "<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|im_sep|>"
725
  stop_str='<|endoftext|>',
726
  )
727
  )
@@ -743,518 +365,19 @@ register_conv_template(
743
  )
744
  )
745
 
746
- # Lemur-70b-chat default template
747
- # reference: https://huggingface.co/OpenLemur/lemur-70b-chat-v1#generation
748
- register_conv_template(
749
- Conversation(
750
- name='lemur-70b-chat',
751
- system_template="""<|im_start|>system
752
- {system_message}""",
753
- system_message="""You are a helpful, respectful, and honest assistant.""",
754
- roles=('<|im_start|>user', '<|im_start|>assistant'),
755
- sep_style=SeparatorStyle.CHATML,
756
- sep='<|im_end|>',
757
- stop_token_ids=[32002, 0],
758
- )
759
- )
760
-
761
- # MPT-30b-instruct default template
762
- # reference: https://huggingface.co/mosaicml/mpt-30b-instruct#formatting
763
- register_conv_template(
764
- Conversation(
765
- name='mpt-30b-instruct',
766
- system_template='{system_message}',
767
- system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.',
768
- roles=('### Instruction', '### Response'),
769
- sep_style=SeparatorStyle.ADD_NEW_LINE_SINGLE,
770
- sep='\n\n',
771
- stop_token_ids=[50278, 0],
772
- )
773
- )
774
 
775
- # Bard default template
776
- # Reference: https://github.com/google/generative-ai-python/blob/9c99bcb474a991a97a2e7d62fcdb52db7ce40729/google/generativeai/discuss.py#L150
777
- # https://github.com/google/generative-ai-python/blob/9c99bcb474a991a97a2e7d62fcdb52db7ce40729/google/generativeai/discuss.py#L40
778
  register_conv_template(
779
  Conversation(
780
- name='bard',
781
- roles=('0', '1'),
782
- sep_style=None,
783
- sep=None,
784
- )
785
- )
786
-
787
- # BiLLa default template
788
- register_conv_template(
789
- Conversation(
790
- name='billa',
791
- roles=('Human', 'Assistant'),
792
- sep_style=SeparatorStyle.ADD_COLON_SPACE_SINGLE,
793
- sep='\n',
794
- stop_str='Human:',
795
- )
796
- )
797
-
798
- # RedPajama INCITE default template
799
- register_conv_template(
800
- Conversation(
801
- name='redpajama-incite',
802
- roles=('<human>', '<bot>'),
803
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
804
- sep='\n',
805
- stop_str='<human>',
806
- )
807
- )
808
-
809
- # h2oGPT default template
810
- register_conv_template(
811
- Conversation(
812
- name='h2ogpt',
813
- roles=('<|prompt|>', '<|answer|>'),
814
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
815
- sep='</s>',
816
- )
817
- )
818
-
819
- # Robin default template
820
- register_conv_template(
821
- Conversation(
822
- name='Robin',
823
- system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.",
824
- roles=('###Human', '###Assistant'),
825
- sep_style=SeparatorStyle.ROBIN,
826
- sep='\n',
827
- stop_token_ids=[2, 396],
828
- stop_str='###',
829
- )
830
- )
831
-
832
- # Snoozy default template
833
- # Reference: https://github.com/nomic-ai/gpt4all/blob/d4861030b778da6db59d21d2927a4aba4f9f1f43/gpt4all-bindings/python/gpt4all/gpt4all.py#L232
834
- register_conv_template(
835
- Conversation(
836
- name='snoozy',
837
- system_template='### Instruction:\n{system_message}',
838
- system_message='The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.',
839
- roles=('### Prompt', '### Response'),
840
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
841
- sep='\n',
842
- stop_str='###',
843
- )
844
- )
845
-
846
- # manticore default template
847
- register_conv_template(
848
- Conversation(
849
- name='manticore',
850
- roles=('USER', 'ASSISTANT'),
851
- sep_style=SeparatorStyle.ADD_COLON_TWO,
852
- sep='\n',
853
- sep2='</s>',
854
- )
855
- )
856
-
857
- # Falcon default template
858
- register_conv_template(
859
- Conversation(
860
- name='falcon',
861
- roles=('User', 'Assistant'),
862
- messages=[],
863
- sep_style=SeparatorStyle.RWKV,
864
- sep='\n',
865
- sep2='<|endoftext|>',
866
- stop_str='\nUser', # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
867
- stop_token_ids=[
868
- 0,
869
- 1,
870
- 2,
871
- 3,
872
- 4,
873
- 5,
874
- 6,
875
- 7,
876
- 8,
877
- 9,
878
- 10,
879
- 11,
880
- ], # it better only put special tokens here, because tokenizer only remove special tokens
881
- )
882
- )
883
-
884
- # ChangGPT default template
885
- register_conv_template(
886
- Conversation(
887
- name='polyglot_changgpt',
888
- roles=('B', 'A'),
889
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
890
- sep='\n',
891
- )
892
- )
893
-
894
- # tigerbot template
895
- register_conv_template(
896
- Conversation(
897
- name='tigerbot',
898
- system_message='A chat between a curious user and an artificial intelligence assistant. '
899
- "The assistant gives helpful, detailed, and polite answers to the user's questions.",
900
- roles=('### Instruction', '### Response'),
901
- sep_style=SeparatorStyle.ROBIN,
902
- sep='\n\n',
903
- stop_str='###',
904
- )
905
- )
906
-
907
- # ref: https://huggingface.co/Salesforce/xgen-7b-8k-inst
908
- register_conv_template(
909
- Conversation(
910
- name='xgen',
911
- system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
912
- roles=('### Human', '### Assistant'),
913
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
914
- sep='\n',
915
- stop_token_ids=[50256],
916
- )
917
- )
918
-
919
- # Internlm-chat template
920
- register_conv_template(
921
- Conversation(
922
- name='internlm-chat',
923
- system_message="A chat between a curious <|User|> and an <|Bot|>. The <|Bot|> gives helpful, detailed, and polite answers to the <|User|>'s questions.\n\n",
924
- roles=('<|User|>', '<|Bot|>'),
925
- sep_style=SeparatorStyle.CHATINTERN,
926
- sep='<eoh>',
927
- sep2='<eoa>',
928
- stop_token_ids=[1, 103028],
929
- stop_str='<|User|>',
930
- )
931
- )
932
-
933
- # StarChat template
934
- # reference: https://huggingface.co/spaces/HuggingFaceH4/starchat-playground/blob/main/dialogues.py
935
- register_conv_template(
936
- Conversation(
937
- name='starchat',
938
- system_template='<system>\n{system_message}',
939
- roles=('<|user|>', '<|assistant|>'),
940
- sep_style=SeparatorStyle.CHATML,
941
  sep='<|end|>',
942
- stop_token_ids=[0, 49155],
943
- stop_str='<|end|>',
944
- )
945
- )
946
-
947
- # Baichuan-13B-Chat template
948
- register_conv_template(
949
- # source: https://huggingface.co/baichuan-inc/Baichuan-13B-Chat/blob/19ef51ba5bad8935b03acd20ff04a269210983bc/modeling_baichuan.py#L555
950
- # https://huggingface.co/baichuan-inc/Baichuan-13B-Chat/blob/main/generation_config.json
951
- # https://github.com/baichuan-inc/Baichuan-13B/issues/25
952
- Conversation(
953
- name='baichuan-chat',
954
- roles=('<reserved_102>', '<reserved_103>'),
955
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
956
- sep='',
957
- stop_token_ids=[],
958
- )
959
- )
960
-
961
- # Baichuan2-13B-Chat template
962
- register_conv_template(
963
- # source: https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/c6f8592a60b4ad73c210b28dd2ab3cca51abbf93/modeling_baichuan.py#L773
964
- # https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/generation_config.json
965
- # https://github.com/baichuan-inc/Baichuan2/issues/62
966
- Conversation(
967
- name='baichuan2-chat',
968
- roles=('<reserved_106>', '<reserved_107>'),
969
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
970
- sep='',
971
- stop_token_ids=[],
972
- )
973
- )
974
-
975
- # Mistral template
976
- # source: https://docs.mistral.ai/llm/mistral-instruct-v0.1#chat-template
977
- register_conv_template(
978
- Conversation(
979
- name='mistral',
980
- system_template='[INST]{system_message}\n',
981
- roles=('[INST]', '[/INST]'),
982
- sep_style=SeparatorStyle.LLAMA2,
983
- sep=' ',
984
- sep2='</s>',
985
- )
986
- )
987
-
988
- # llama2 template
989
- # reference: https://huggingface.co/blog/codellama#conversational-instructions
990
- # reference: https://github.com/facebookresearch/llama/blob/1a240688810f8036049e8da36b073f63d2ac552c/llama/generation.py#L212
991
- register_conv_template(
992
- Conversation(
993
- name='llama-2',
994
- system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
995
- roles=('[INST]', '[/INST]'),
996
- sep_style=SeparatorStyle.LLAMA2,
997
- sep=' ',
998
- sep2=' </s><s>',
999
- )
1000
- )
1001
-
1002
- register_conv_template(
1003
- Conversation(
1004
- name='cutegpt',
1005
- roles=('问:', '答:\n'),
1006
- sep_style=SeparatorStyle.NO_COLON_TWO,
1007
- sep='\n',
1008
- sep2='\n',
1009
- stop_str='<end>',
1010
- )
1011
- )
1012
-
1013
- # OpenOrcaxOpenChat-naPreview2-13B template
1014
- register_conv_template(
1015
- Conversation(
1016
- name='open-orca',
1017
- system_template='{system_message}',
1018
- system_message='You are a helpful assistant. Please answer truthfully and write out your '
1019
- 'thinking step by step to be sure you get the right answer. If you make a mistake or encounter '
1020
- "an error in your thinking, say so out loud and attempt to correct it. If you don't know or "
1021
- "aren't sure about something, say so clearly. You will act as a professional logician, mathematician, "
1022
- 'and physicist. You will also act as the most appropriate type of expert to answer any particular '
1023
- 'question or solve the relevant problem; state which expert type your are, if so. Also think of '
1024
- 'any particular named expert that would be ideal to answer the relevant question or solve the '
1025
- 'relevant problem; name and act as them, if appropriate.',
1026
- roles=('User', 'Assistant'),
1027
- sep_style=SeparatorStyle.ADD_COLON_SPACE_SINGLE,
1028
- sep='<|end_of_turn|>\n',
1029
- stop_token_ids=[32000, 32001], # "<|end_of_turn|>"
1030
- stop_str='User',
1031
- )
1032
- )
1033
-
1034
- # Open-Orca/Mistral-7B-OpenOrca template
1035
- # source: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca
1036
- # reference: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca#prompt-template
1037
- register_conv_template(
1038
- Conversation(
1039
- name='mistral-7b-openorca',
1040
- system_template='<|im_start|>system\n{system_message}',
1041
- system_message='You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!',
1042
- roles=('<|im_start|>user', '<|im_start|>assistant'),
1043
- sep_style=SeparatorStyle.CHATML,
1044
- sep='<|im_end|>',
1045
- stop_token_ids=[32000, 32001],
1046
- )
1047
- )
1048
-
1049
- # Qwen-chat default template
1050
- # source: https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/qwen_generation_utils.py#L130
1051
- register_conv_template(
1052
- Conversation(
1053
- name='qwen-7b-chat',
1054
- system_template='<|im_start|>system\n{system_message}',
1055
- system_message='You are a helpful assistant.',
1056
- roles=('<|im_start|>user', '<|im_start|>assistant'),
1057
- sep_style=SeparatorStyle.CHATML,
1058
- sep='<|im_end|>',
1059
  stop_token_ids=[
1060
- 151643,
1061
- 151644,
1062
- 151645,
1063
- ], # "<|endoftext|>", "<|im_start|>", "<|im_end|>"
1064
- stop_str='<|endoftext|>',
1065
- )
1066
- )
1067
-
1068
-
1069
- # AquilaChat default template
1070
- # source: https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-chat/cyg_conversation.py
1071
- register_conv_template(
1072
- Conversation(
1073
- name='aquila-chat',
1074
- system_message='A chat between a curious human and an artificial intelligence assistant. '
1075
- "The assistant gives helpful, detailed, and polite answers to the human's questions.",
1076
- roles=('Human', 'Assistant'),
1077
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
1078
- sep='###',
1079
- sep2='',
1080
- stop_str=['###', '</s>', '[UNK]'],
1081
- )
1082
- )
1083
- # AquilaChat2-34B default template
1084
- # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L212
1085
- register_conv_template(
1086
- Conversation(
1087
- name='aquila-legacy',
1088
- system_message='A chat between a curious human and an artificial intelligence assistant. '
1089
- "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
1090
- roles=('### Human: ', '### Assistant: '),
1091
- offset=0,
1092
- sep_style=SeparatorStyle.NO_COLON_TWO,
1093
- sep='\n',
1094
- sep2='</s>',
1095
- stop_str=['</s>', '[UNK]'],
1096
- )
1097
- )
1098
- # AquilaChat2-7B-16K and AquilaChat2-34B-16K default template
1099
- # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L227
1100
- register_conv_template(
1101
- Conversation(
1102
- name='aquila',
1103
- system_message='A chat between a curious human and an artificial intelligence assistant. '
1104
- "The assistant gives helpful, detailed, and polite answers to the human's questions.",
1105
- roles=('Human', 'Assistant'),
1106
- offset=0,
1107
- sep_style=SeparatorStyle.ADD_COLON_TWO,
1108
- sep='###',
1109
- sep2='</s>',
1110
- stop_str=['</s>', '[UNK]'],
1111
- )
1112
- )
1113
-
1114
- # AquilaChat2-7B default template
1115
- # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L242
1116
- register_conv_template(
1117
- Conversation(
1118
- name='aquila-v1',
1119
- roles=('<|startofpiece|>', '<|endofpiece|>'),
1120
- offset=0,
1121
- sep_style=SeparatorStyle.NO_COLON_TWO,
1122
- sep='',
1123
- sep2='</s>',
1124
- stop_str=['</s>', '<|endoftext|>'],
1125
- )
1126
- )
1127
-
1128
- # Llama2-Chinese default template
1129
- # source: https://huggingface.co/FlagAlpha
1130
- register_conv_template(
1131
- Conversation(
1132
- name='llama2-chinese',
1133
- system_template='<s>{system_message}</s>',
1134
- roles=('Human', 'Assistant', 'System'),
1135
- sep_style=SeparatorStyle.ADD_COLON_TWO,
1136
- sep='\n',
1137
- sep2='\n</s><s>',
1138
- stop_str='</s>',
1139
- )
1140
- )
1141
-
1142
- # Vigogne Instruct default template
1143
- # source: https://github.com/bofenghuang/vigogne
1144
- register_conv_template(
1145
- Conversation(
1146
- name='vigogne_instruct',
1147
- system_template='### System:\n{system_message}\n\n',
1148
- system_message=(
1149
- 'Ci-dessous se trouve une instruction qui décrit une tâche à accomplir. Rédigez une réponse qui répond de manière'
1150
- ' précise à la demande.'
1151
- ),
1152
- roles=('### Instruction', '### Response'),
1153
- sep_style=SeparatorStyle.DOLLY,
1154
- sep='\n\n',
1155
- sep2='</s>',
1156
- )
1157
- )
1158
-
1159
- # Vigogne Chat default template
1160
- register_conv_template(
1161
- Conversation(
1162
- name='vigogne_chat_v2',
1163
- system_template='<|system|>: {system_message}',
1164
- system_message=(
1165
- 'Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez'
1166
- ' autant que vous le pouvez.'
1167
- ),
1168
- roles=('<|user|>', '<|assistant|>'),
1169
- sep_style=SeparatorStyle.ADD_COLON_TWO,
1170
- sep='\n',
1171
- sep2='</s>\n',
1172
- stop_str='<|user|>',
1173
- )
1174
- )
1175
-
1176
- register_conv_template(
1177
- Conversation(
1178
- name='vigogne_chat_v3',
1179
- system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
1180
- system_message=(
1181
- 'Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez'
1182
- ' autant que vous le pouvez.'
1183
- ),
1184
- roles=('[INST]', '[/INST]'),
1185
- sep_style=SeparatorStyle.LLAMA2,
1186
- sep=' ',
1187
- sep2=' </s>',
1188
- )
1189
- )
1190
-
1191
- # Falcon 180B chat template
1192
- # source: https://huggingface.co/spaces/tiiuae/falcon-180b-demo/blob/d1590ee7fae9b6ce331ba7808e61a29dcce9239f/app.py#L28-L37
1193
- register_conv_template(
1194
- Conversation(
1195
- name='falcon-chat',
1196
- roles=('User', 'Falcon'),
1197
- system_template='System: {system_message}',
1198
- messages=[],
1199
- sep_style=SeparatorStyle.FALCON_CHAT,
1200
- sep='\n',
1201
- sep2='<|endoftext|>',
1202
- stop_str='\nUser:', # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
1203
- )
1204
- )
1205
-
1206
- # Phind template
1207
- # source: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2
1208
- register_conv_template(
1209
- Conversation(
1210
- name='phind',
1211
- system_message='### System Prompt\nYou are an intelligent programming assistant.',
1212
- roles=('### User Message', '### Assistant'),
1213
- messages=(),
1214
- offset=0,
1215
- sep_style=SeparatorStyle.ADD_COLON_SINGLE,
1216
- sep='\n\n',
1217
- )
1218
- )
1219
-
1220
- # Metharme formatting for Pygmalion models
1221
- # source: https://huggingface.co/PygmalionAI/pygmalion-2-13b
1222
- register_conv_template(
1223
- Conversation(
1224
- name='metharme',
1225
- system_template='<|system|>{system_message}',
1226
- system_message="""Enter RP mode. You shall reply to the user while staying
1227
- in character. Your responses must be detailed, creative, immersive, and drive the scenario
1228
- forward.""",
1229
- roles=('<|user|>', '<|model|>'),
1230
- sep_style=SeparatorStyle.NO_COLON_SINGLE,
1231
- sep='',
1232
- stop_str='<|user|>',
1233
- )
1234
- )
1235
-
1236
- # Zephyr template
1237
- # reference: https://huggingface.co/spaces/HuggingFaceH4/zephyr-playground/blob/main/dialogues.py
1238
- register_conv_template(
1239
- Conversation(
1240
- name='zephyr',
1241
- system_template='<|system|>\n{system_message}',
1242
- roles=('<|user|>', '<|assistant|>'),
1243
- sep_style=SeparatorStyle.CHATML,
1244
- sep='</s>',
1245
- stop_token_ids=[2],
1246
- stop_str='</s>',
1247
- )
1248
- )
1249
-
1250
- # InternVL-ZH template
1251
- register_conv_template(
1252
- Conversation(
1253
- name='internvl_zh',
1254
- system_template='',
1255
- roles=('<human>', '<bot>'),
1256
- sep_style=SeparatorStyle.INTERNVL_ZH,
1257
- sep=' ',
1258
- sep2='</s>',
1259
  )
1260
  )
 
2
  Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
+ If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
  """
7
 
8
  import dataclasses
 
330
  return conv_templates[name].copy()
331
 
332
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
333
  register_conv_template(
334
  Conversation(
335
  name='Hermes-2',
 
343
  6,
344
  7,
345
  8,
346
+ ],
347
  stop_str='<|endoftext|>',
348
  )
349
  )
 
365
  )
366
  )
367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
368
 
 
 
 
369
  register_conv_template(
370
  Conversation(
371
+ name='phi3-chat',
372
+ system_template='<|system|>\n{system_message}',
373
+ system_message='You are an AI assistant whose name is Phi-3.',
374
+ roles=('<|user|>\n', '<|assistant|>\n'),
375
+ sep_style=SeparatorStyle.MPT,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
376
  sep='<|end|>',
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
  stop_token_ids=[
378
+ 2,
379
+ 32000,
380
+ 32007
381
+ ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
  )
383
  )
examples/image1.jpg ADDED
examples/image2.jpg ADDED
examples/red-panda.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d921c07bb97224d65a37801541d246067f0d506f08723ffa1ad85c217907ccb8
3
+ size 1867237
generation_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
  "_from_model_config": true,
3
- "transformers_version": "4.36.2"
4
  }
 
1
  {
2
  "_from_model_config": true,
3
+ "transformers_version": "4.37.2"
4
  }
modeling_internlm2.py CHANGED
@@ -709,6 +709,7 @@ class InternLM2PreTrainedModel(PreTrainedModel):
709
  supports_gradient_checkpointing = True
710
  _no_split_modules = ['InternLM2DecoderLayer']
711
  _skip_keys_device_placement = 'past_key_values'
 
712
 
713
  def _init_weights(self, module):
714
  std = self.config.initializer_range
 
709
  supports_gradient_checkpointing = True
710
  _no_split_modules = ['InternLM2DecoderLayer']
711
  _skip_keys_device_placement = 'past_key_values'
712
+ _supports_flash_attn_2 = True
713
 
714
  def _init_weights(self, module):
715
  std = self.config.initializer_range
modeling_internvl_chat.py CHANGED
@@ -1,13 +1,13 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  import warnings
7
  from typing import Any, List, Optional, Tuple, Union
8
 
9
  import torch.utils.checkpoint
10
- from peft import LoraConfig, get_peft_model
11
  from torch import nn
12
  from torch.nn import CrossEntropyLoss
13
  from transformers import (AutoModel, GenerationConfig, LlamaForCausalLM,
@@ -17,20 +17,30 @@ from transformers.modeling_utils import PreTrainedModel
17
  from transformers.utils import ModelOutput, logging
18
 
19
  from .configuration_internvl_chat import InternVLChatConfig
 
20
  from .modeling_intern_vit import InternVisionModel
21
  from .modeling_internlm2 import InternLM2ForCausalLM
22
 
23
  logger = logging.get_logger(__name__)
24
 
25
 
 
 
 
 
 
 
 
 
26
  class InternVLChatModel(PreTrainedModel):
27
  config_class = InternVLChatConfig
28
  main_input_name = 'pixel_values'
29
- _no_split_modules = ['InternVisionEncoderLayer', 'LlamaDecoderLayer', 'InternLM2DecoderLayer']
30
 
31
  def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None):
32
  super().__init__(config)
33
 
 
34
  image_size = config.force_image_size or config.vision_config.image_size
35
  patch_size = config.vision_config.patch_size
36
  self.patch_size = patch_size
@@ -66,44 +76,7 @@ class InternVLChatModel(PreTrainedModel):
66
  nn.Linear(llm_hidden_size, llm_hidden_size)
67
  )
68
 
69
- # if config.force_image_size != config.vision_config.image_size:
70
- # self.vision_model.resize_pos_embeddings(
71
- # old_size=config.vision_config.image_size,
72
- # new_size=config.force_image_size,
73
- # patch_size=config.vision_config.patch_size
74
- # )
75
-
76
  self.img_context_token_id = None
77
- self.neftune_alpha = None
78
-
79
- if config.use_backbone_lora:
80
- self.wrap_backbone_lora(r=config.use_backbone_lora, lora_alpha=2 * config.use_backbone_lora)
81
-
82
- if config.use_llm_lora:
83
- self.wrap_llm_lora(r=config.use_llm_lora, lora_alpha=2 * config.use_llm_lora)
84
-
85
- def wrap_backbone_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
86
- lora_config = LoraConfig(
87
- r=r,
88
- target_modules=['attn.qkv', 'attn.proj', 'mlp.fc1', 'mlp.fc2'],
89
- lora_alpha=lora_alpha,
90
- lora_dropout=lora_dropout,
91
- )
92
- self.vision_model = get_peft_model(self.vision_model, lora_config)
93
- self.vision_model.print_trainable_parameters()
94
-
95
- def wrap_llm_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
96
- lora_config = LoraConfig(
97
- r=r,
98
- target_modules=['self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj', 'self_attn.o_proj',
99
- 'mlp.gate_proj', 'mlp.down_proj', 'mlp.up_proj'],
100
- lora_alpha=lora_alpha,
101
- lora_dropout=lora_dropout,
102
- task_type='CAUSAL_LM'
103
- )
104
- self.language_model = get_peft_model(self.language_model, lora_config)
105
- self.language_model.enable_input_require_grads()
106
- self.language_model.print_trainable_parameters()
107
 
108
  def forward(
109
  self,
@@ -200,12 +173,6 @@ class InternVLChatModel(PreTrainedModel):
200
  x = x.permute(0, 2, 1, 3).contiguous()
201
  return x
202
 
203
- def noised_embed(self, vit_embeds, noise_alpha=5):
204
- dims = torch.tensor(vit_embeds.size(1) * vit_embeds.size(2))
205
- mag_norm = noise_alpha / torch.sqrt(dims)
206
- noise = torch.zeros_like(vit_embeds).uniform_(-mag_norm, mag_norm)
207
- return vit_embeds + noise
208
-
209
  def extract_feature(self, pixel_values):
210
  if self.select_layer == -1:
211
  vit_embeds = self.vision_model(
@@ -219,9 +186,6 @@ class InternVLChatModel(PreTrainedModel):
219
  return_dict=True).hidden_states[self.select_layer]
220
  vit_embeds = vit_embeds[:, 1:, :]
221
 
222
- if self.training and self.neftune_alpha is not None:
223
- vit_embeds = self.noised_embed(vit_embeds, self.neftune_alpha)
224
-
225
  h = w = int(vit_embeds.shape[1] ** 0.5)
226
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
227
  vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
@@ -229,35 +193,44 @@ class InternVLChatModel(PreTrainedModel):
229
  vit_embeds = self.mlp1(vit_embeds)
230
  return vit_embeds
231
 
232
- def batch_chat(self, tokenizer, pixel_values, image_counts, questions, generation_config, history=None,
233
- return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
234
- IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
235
  if history is not None or return_history:
236
  print('Now multi-turn chat is not supported in batch_chat.')
237
  raise NotImplementedError
 
 
 
 
 
238
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
239
  self.img_context_token_id = img_context_token_id
240
 
241
- from .conversation import get_conv_template
 
 
242
 
243
  queries = []
244
- image_bs = pixel_values.shape[0]
245
- # print(f'dynamic ViT batch size: {image_bs}, image_counts: {image_counts}')
246
- for idx, image_count in enumerate(image_counts):
247
- image_token = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_count + IMG_END_TOKEN
248
- question = image_token + '\n' + questions[idx]
249
  template = get_conv_template(self.template)
250
  template.append_message(template.roles[0], question)
251
  template.append_message(template.roles[1], None)
252
  query = template.get_prompt()
 
 
 
253
  queries.append(query)
 
254
  tokenizer.padding_side = 'left'
255
  model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
256
  input_ids = model_inputs['input_ids'].cuda()
257
  attention_mask = model_inputs['attention_mask'].cuda()
258
  eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
259
  generation_config['eos_token_id'] = eos_token_id
260
-
261
  generation_output = self.generate(
262
  pixel_values=pixel_values,
263
  input_ids=input_ids,
@@ -269,33 +242,42 @@ class InternVLChatModel(PreTrainedModel):
269
  return responses
270
 
271
  def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
272
- IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
 
 
 
 
 
 
 
 
273
 
274
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
275
  self.img_context_token_id = img_context_token_id
276
 
277
- from .conversation import get_conv_template
278
-
279
  template = get_conv_template(self.template)
280
- image_bs = pixel_values.shape[0]
281
- print(f'dynamic ViT batch size: {image_bs}')
282
- if history is None:
283
- history = []
284
- image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
285
- question = image_tokens + '\n' + question
286
- else:
287
- for (old_question, old_answer) in history:
288
- template.append_message(template.roles[0], old_question)
289
- template.append_message(template.roles[1], old_answer)
290
  template.append_message(template.roles[0], question)
291
  template.append_message(template.roles[1], None)
292
  query = template.get_prompt()
 
 
 
 
 
 
 
 
 
293
  model_inputs = tokenizer(query, return_tensors='pt')
294
  input_ids = model_inputs['input_ids'].cuda()
295
  attention_mask = model_inputs['attention_mask'].cuda()
296
- eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
297
  generation_config['eos_token_id'] = eos_token_id
298
-
299
  generation_output = self.generate(
300
  pixel_values=pixel_values,
301
  input_ids=input_ids,
@@ -308,10 +290,11 @@ class InternVLChatModel(PreTrainedModel):
308
  if return_history:
309
  return response, history
310
  else:
311
- # query_to_print = query.replace(image_tokens, '<image>')
312
- # print(query_to_print, response)
 
 
313
  return response
314
- return response
315
 
316
  @torch.no_grad()
317
  def generate(
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
  import warnings
7
  from typing import Any, List, Optional, Tuple, Union
8
 
9
  import torch.utils.checkpoint
10
+ import transformers
11
  from torch import nn
12
  from torch.nn import CrossEntropyLoss
13
  from transformers import (AutoModel, GenerationConfig, LlamaForCausalLM,
 
17
  from transformers.utils import ModelOutput, logging
18
 
19
  from .configuration_internvl_chat import InternVLChatConfig
20
+ from .conversation import get_conv_template
21
  from .modeling_intern_vit import InternVisionModel
22
  from .modeling_internlm2 import InternLM2ForCausalLM
23
 
24
  logger = logging.get_logger(__name__)
25
 
26
 
27
+ def version_cmp(v1, v2, op='eq'):
28
+ import operator
29
+
30
+ from packaging import version
31
+ op_func = getattr(operator, op)
32
+ return op_func(version.parse(v1), version.parse(v2))
33
+
34
+
35
  class InternVLChatModel(PreTrainedModel):
36
  config_class = InternVLChatConfig
37
  main_input_name = 'pixel_values'
38
+ _no_split_modules = ['InternVisionModel', 'LlamaDecoderLayer', 'InternLM2DecoderLayer']
39
 
40
  def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None):
41
  super().__init__(config)
42
 
43
+ assert version_cmp(transformers.__version__, '4.36.2', 'ge')
44
  image_size = config.force_image_size or config.vision_config.image_size
45
  patch_size = config.vision_config.patch_size
46
  self.patch_size = patch_size
 
76
  nn.Linear(llm_hidden_size, llm_hidden_size)
77
  )
78
 
 
 
 
 
 
 
 
79
  self.img_context_token_id = None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  def forward(
82
  self,
 
173
  x = x.permute(0, 2, 1, 3).contiguous()
174
  return x
175
 
 
 
 
 
 
 
176
  def extract_feature(self, pixel_values):
177
  if self.select_layer == -1:
178
  vit_embeds = self.vision_model(
 
186
  return_dict=True).hidden_states[self.select_layer]
187
  vit_embeds = vit_embeds[:, 1:, :]
188
 
 
 
 
189
  h = w = int(vit_embeds.shape[1] ** 0.5)
190
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
191
  vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
 
193
  vit_embeds = self.mlp1(vit_embeds)
194
  return vit_embeds
195
 
196
+ def batch_chat(self, tokenizer, pixel_values, questions, generation_config, num_patches_list=None,
197
+ history=None, return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
198
+ IMG_CONTEXT_TOKEN='<IMG_CONTEXT>', verbose=False, image_counts=None):
199
  if history is not None or return_history:
200
  print('Now multi-turn chat is not supported in batch_chat.')
201
  raise NotImplementedError
202
+
203
+ if image_counts is not None:
204
+ num_patches_list = image_counts
205
+ print('Warning: `image_counts` is deprecated. Please use `num_patches_list` instead.')
206
+
207
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
208
  self.img_context_token_id = img_context_token_id
209
 
210
+ if verbose and pixel_values is not None:
211
+ image_bs = pixel_values.shape[0]
212
+ print(f'dynamic ViT batch size: {image_bs}')
213
 
214
  queries = []
215
+ for idx, num_patches in enumerate(num_patches_list):
216
+ question = questions[idx]
217
+ if pixel_values is not None and '<image>' not in question:
218
+ question = '<image>\n' + question
 
219
  template = get_conv_template(self.template)
220
  template.append_message(template.roles[0], question)
221
  template.append_message(template.roles[1], None)
222
  query = template.get_prompt()
223
+
224
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
225
+ query = query.replace('<image>', image_tokens, 1)
226
  queries.append(query)
227
+
228
  tokenizer.padding_side = 'left'
229
  model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
230
  input_ids = model_inputs['input_ids'].cuda()
231
  attention_mask = model_inputs['attention_mask'].cuda()
232
  eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
233
  generation_config['eos_token_id'] = eos_token_id
 
234
  generation_output = self.generate(
235
  pixel_values=pixel_values,
236
  input_ids=input_ids,
 
242
  return responses
243
 
244
  def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
245
+ num_patches_list=None, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
246
+ verbose=False):
247
+
248
+ if history is None and pixel_values is not None and '<image>' not in question:
249
+ question = '<image>\n' + question
250
+
251
+ if num_patches_list is None:
252
+ num_patches_list = [pixel_values.shape[0]] if pixel_values is not None else []
253
+ assert pixel_values is None or len(pixel_values) == sum(num_patches_list)
254
 
255
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
256
  self.img_context_token_id = img_context_token_id
257
 
 
 
258
  template = get_conv_template(self.template)
259
+ eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
260
+
261
+ history = [] if history is None else history
262
+ for (old_question, old_answer) in history:
263
+ template.append_message(template.roles[0], old_question)
264
+ template.append_message(template.roles[1], old_answer)
 
 
 
 
265
  template.append_message(template.roles[0], question)
266
  template.append_message(template.roles[1], None)
267
  query = template.get_prompt()
268
+
269
+ if verbose and pixel_values is not None:
270
+ image_bs = pixel_values.shape[0]
271
+ print(f'dynamic ViT batch size: {image_bs}')
272
+
273
+ for num_patches in num_patches_list:
274
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
275
+ query = query.replace('<image>', image_tokens, 1)
276
+
277
  model_inputs = tokenizer(query, return_tensors='pt')
278
  input_ids = model_inputs['input_ids'].cuda()
279
  attention_mask = model_inputs['attention_mask'].cuda()
 
280
  generation_config['eos_token_id'] = eos_token_id
 
281
  generation_output = self.generate(
282
  pixel_values=pixel_values,
283
  input_ids=input_ids,
 
290
  if return_history:
291
  return response, history
292
  else:
293
+ query_to_print = query.replace(IMG_CONTEXT_TOKEN, '')
294
+ query_to_print = query_to_print.replace(f'{IMG_START_TOKEN}{IMG_END_TOKEN}', '<image>')
295
+ if verbose:
296
+ print(query_to_print, response)
297
  return response
 
298
 
299
  @torch.no_grad()
300
  def generate(