---
pipeline_tag: image-text-to-text
datasets:
- openbmb/RLAIF-V-Dataset
library_name: transformers
language:
- multilingual
tags:
- minicpm-o
- omni
- vision
- ocr
- multi-image
- video
- custom_code
---
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
[GitHub](https://github.com/OpenBMB/MiniCPM-V) | Online Demo [US](https://minicpm-omni-webdemo-us.modelbest.cn)/[CN](https://minicpm-omni-webdemo.modelbest.cn)
## MiniCPM-o 2.6
**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
- 🔥 **Leading Visual Capability.**
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, voice cloning, role play, etc.
- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding.
- 💪 **Strong OCR Capability and Others.**
Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
- 🚀 **Superior Efficiency.**
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
- 💫 **Easy Usage.**
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
**Model Architecture.**
- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
### Evaluation
Click to view visual understanding results.
**Image Understanding**
Model |
Size |
Token Density+ |
OpenCompass |
OCRBench |
MathVista mini |
ChartQA |
MMVet |
MMStar |
MME |
MMB1.1 test |
AI2D |
MMMU val |
HallusionBench |
TextVQA val |
DocVQA test |
MathVerse mini |
MathVision |
MMHal Score |
Proprietary |
GPT-4o-20240513 |
- |
1088 |
69.9 |
736 |
61.3 |
85.7 |
69.1 |
63.9 |
2328.7 |
82.2 |
84.6 |
69.2 |
55.0 |
- |
92.8 |
50.2 |
30.4 |
3.6 |
Claude3.5-Sonnet |
- |
750 |
67.9 |
788 |
61.6 |
90.8 |
66.0 |
62.2 |
1920.0 |
78.5 |
80.2 |
65.9 |
49.9 |
- |
95.2 |
- |
- |
3.4 |
Gemini-1.5-Pro |
- |
- |
64.4 |
754 |
57.7 |
81.3 |
64.0 |
59.1 |
2110.6 |
73.9 |
79.1 |
60.6 |
45.6 |
73.5 |
86.5 |
- |
19.2 |
- |
GPT-4o-mini-20240718 |
- |
1088 |
64.1 |
785 |
52.4 |
- |
66.9 |
54.8 |
2003.4 |
76.0 |
77.8 |
60.0 |
46.1 |
- |
- |
- |
- |
3.3 |
Open Source |
Cambrian-34B |
34B |
1820 |
58.3 |
591 |
50.3 |
75.6 |
53.2 |
54.2 |
2049.9 |
77.8 |
79.5 |
50.4 |
41.6 |
76.7 |
75.5 |
- |
- |
- |
GLM-4V-9B |
13B |
784 |
59.1 |
776 |
51.1 |
- |
58.0 |
54.8 |
2018.8 |
67.9 |
71.2 |
46.9 |
45.0 |
- |
- |
- |
- |
- |
Pixtral-12B |
12B |
256 |
61.0 |
685 |
56.9 |
81.8 |
58.5 |
54.5 |
- |
72.7 |
79.0 |
51.1 |
47.0 |
75.7 |
90.7 |
- |
- |
- |
DeepSeek-VL2-27B (4B) |
27B |
672 |
66.4 |
809 |
63.9 |
86.0 |
60.0 |
61.9 |
2253.0 |
81.2 |
83.8 |
54.0 |
45.3 |
84.2 |
93.3 |
- |
- |
3.0 |
Qwen2-VL-7B |
8B |
784 |
67.1 |
866 |
58.2 |
83.0 |
62.0 |
60.7 |
2326.0 |
81.8 |
83.0 |
54.1 |
50.6 |
84.3 |
94.5 |
31.9 |
16.3 |
3.2 |
LLaVA-OneVision-72B |
72B |
182 |
68.1 |
741 |
67.5 |
83.7 |
60.6 |
65.8 |
2261.0 |
85.0 |
85.6 |
56.8 |
49.0 |
80.5 |
91.3 |
39.1 |
- |
3.5 |
InternVL-2.5-8B |
8B |
706 |
68.3 |
822 |
64.4 |
84.8 |
62.8 |
62.8 |
2344.0 |
83.6 |
84.5 |
56.0 |
50.1 |
79.1 |
93.0 |
39.5 |
19.7 |
3.4 |
MiniCPM-V 2.6 |
8B |
2822 |
65.2 |
852* |
60.6 |
79.4 |
60.0 |
57.5 |
2348.4* |
78.0 |
82.1 |
49.8* |
48.1* |
80.1 |
90.8 |
25.7 |
18.3 |
3.6 |
MiniCPM-o 2.6 |
8B |
2822 |
70.2 |
897* |
71.9* |
86.9* |
67.5 |
64.0 |
2372.0* |
80.5 |
85.8 |
50.4* |
51.9 |
82.0 |
93.5 |
41.4* |
23.1* |
3.8 |
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
+ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
**Multi-image and Video Understanding**
Model |
Size |
BLINK-val |
Mantis-Eval |
MIRB |
Video-MME (wo / w subs) |
Proprietary |
GPT-4o-20240513 |
- |
68 |
- |
- |
71.9/77.2 |
GPT4V |
- |
54.6 |
62.7 |
53.1 |
59.9/63.3 |
Open-source |
LLaVA-NeXT-Interleave 14B |
14B |
52.6 |
66.4 |
30.2 |
- |
LLaVA-One-Vision-72B |
72B |
55.4 |
77.6 |
- |
66.2/69.5 |
MANTIS 8B |
8B |
49.1 |
59.5 |
34.8 |
- |
Qwen2-VL-7B |
8B |
53.2 |
69.6* |
67.6* |
63.3/69.0 |
InternVL-2.5-8B |
8B |
54.8 |
67.7 |
52.5 |
64.2/66.9 |
MiniCPM-V 2.6 |
8B |
53 |
69.1 |
53.8 |
60.9/63.6 |
MiniCPM-o 2.6 |
8B |
56.7 |
71.9 |
58.6 |
63.9/67.9 |
* We evaluate officially released checkpoints by ourselves.
Click to view audio understanding and speech conversation results.
**Audio Understanding**
Task |
Size |
ASR (zh) |
ASR (en) |
ASR |
Emotion |
Metric |
|
CER↓ |
WER↓ |
BLEU↑ |
ACC↑ |
Dataset |
|
AISHELL-1 |
Fleurs zh |
WenetSpeech test-net |
LibriSpeech test-clean |
GigaSpeech |
TED-LIUM |
CoVoST en2zh |
CoVoST zh2en |
MELD emotion |
Proprietary |
GPT-4o-Realtime |
- |
7.3* |
5.4* |
28.9* |
2.6* |
12.9* |
4.8* |
37.1* |
15.7* |
33.2* |
Gemini-1.5-Pro |
- |
4.5* |
5.9* |
14.3* |
2.9* |
10.6* |
3.0* |
47.3* |
22.6* |
48.4* |
Open-Source |
Qwen2-Audio |
8B |
- |
7.5 |
- |
1.6 |
- |
- |
45.2 |
24.4 |
55.3 |
Qwen2-Audio-Instruction |
8B |
2.6* |
6.9* |
10.3* |
3.1* |
9.7* |
5.9* |
39.5* |
22.9* |
17.4* |
GLM-4-Voice-Base |
9B |
2.5 |
- |
- |
2.8 |
- |
- |
- |
- |
MiniCPM-o 2.6 |
8B |
1.6 |
4.4 |
6.9 |
1.7 |
8.7 |
3.0 |
48.2 |
27.2 |
52.4 |
* We evaluate officially released checkpoints by ourselves.
**Speech Generation**
Task |
Size |
SpeechQA |
Metric |
|
ACC↑ |
G-Eval (10 point)↑ |
Semantic ELO score↑ |
Acoustic ELO score↑ |
Overall ELO score↑ |
UTMOS↑ |
ASR-WER↓ |
Dataset |
|
Speech Llama Q. |
Speech Web Q. |
Speech Trivia QA |
Speech AlpacaEval |
AudioArena |
Proprietary |
GPT-4o-Realtime |
|
71.7 |
51.6 |
69.7 |
7.4 |
1157 |
1203 |
1200 |
4.2 |
2.3 |
Open-Source |
GLM-4-Voice |
9B |
50.0 |
32.0 |
36.4 |
5.1 |
999 |
1147 |
1035 |
4.1 |
11.7 |
Llama-Omni |
8B |
45.3 |
22.9 |
10.7 |
3.9 |
960 |
878 |
897 |
3.2 |
24.3 |
Moshi |
7B |
43.7 |
23.8 |
16.7 |
2.4 |
871 |
808 |
875 |
2.8 |
8.2 |
Mini-Omni |
1B |
22.0 |
12.8 |
6.9 |
2.5 |
926 |
803 |
865 |
3.4 |
10.0 |
MiniCPM-o 2.6 |
8B |
61.0 |
40.0 |
40.2 |
5.1 |
1088 |
1163 |
1131 |
4.2 |
9.8 |
All results are from AudioEvals, and the evaluation methods along with further details can be found in AudioEvals.
**Voice Cloning**
Task |
Voice cloning |
Metric |
SIMO↑ |
SIMO↑ |
Dataset |
Seed-TTS test-zh |
Seed-TTS test-en |
F5-TTS |
76 |
67 |
CosyVoice |
75 |
64 |
FireRedTTS |
63 |
46 |
MiniCPM-o 2.6 |
57 |
47 |
Note: Mimick Task: Takes audio input, and outputs both an ASR transcription and a voice imitation (TTS)
Click to view multimodal live streaming results.
**Multimodal Live Streaming**: results on StreamingBench
Model |
Size |
Real-Time Video Understanding |
Omni-Source Understanding |
Contextual Understanding |
Overall |
Proprietary |
Gemini 1.5 Pro |
- |
77.4 |
67.8 |
51.1 |
70.3 |
GPT-4o |
- |
74.5 |
51.0 |
48.0 |
64.1 |
Claude-3.5-Sonnet |
- |
74.0 |
41.4 |
37.8 |
59.7 |
Open-source |
VILA-1.5 |
8B |
61.5 |
37.5 |
26.7 |
49.5 |
LongVA |
7B |
63.1 |
35.9 |
30.2 |
50.7 |
LLaVA-Next-Video-34B |
34B |
69.8 |
41.7 |
34.3 |
56.7 |
Qwen2-VL-7B |
8B |
71.2 |
40.7 |
33.1 |
57.0 |
InternVL2-8B |
8B |
70.1 |
42.7 |
34.1 |
57.0 |
VITA-1.5 |
8B |
70.9 |
40.8 |
35.8 |
57.4 |
LLaVA-OneVision-7B |
8B |
74.3 |
40.8 |
31.0 |
58.4 |
InternLM-XC2.5-OL-7B |
8B |
75.4 |
46.2 |
33.6 |
60.8 |
MiniCPM-V 2.6 |
8B |
72.4 |
40.2 |
33.4 |
57.7 |
MiniCPM-o 2.6 |
8B |
79.9 |
53.4 |
38.5 |
66.0 |
### Examples
We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
## Online Demo
Click here to try the online demo of **MiniCPM-o 2.6** on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn) server.
## Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
```
Pillow==10.1.0
torch==2.2.0
torchaudio==2.2.0
torchvision==0.17.0
transformers==4.44.2
librosa==0.9.0
soundfile==0.12.1
vector-quantize-pytorch==1.18.5
vocos==0.1.0
decord
moviepy
```
### Model initialization
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
# load omni model default, the default init_vision/init_audio/init_tts is True
# if load vision-only model, please set init_audio=False and init_tts=False
# if load audio-only model, please set init_vision=False
model = AutoModel.from_pretrained(
'openbmb/MiniCPM-o-2_6',
trust_remote_code=True,
attn_implementation='sdpa', # sdpa or flash_attention_2
torch_dtype=torch.bfloat16,
init_vision=True,
init_audio=True,
init_tts=True
)
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
# In addition to vision-only mode, tts processor and vocos also needs to be initialized
model.init_tts()
model.tts.float()
```
### Omni mode
we provide two inference modes: chat and streaming
#### Chat inference
```python
import math
import numpy as np
from PIL import Image
from moviepy.editor import VideoFileClip
import tempfile
import librosa
import soundfile as sf
def get_video_chunk_content(video_path, flatten=True):
video = VideoFileClip(video_path)
print('video_duration:', video.duration)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
temp_audio_file_path = temp_audio_file.name
video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
num_units = math.ceil(video.duration)
# 1 frame + 1s audio chunk
contents= []
for i in range(num_units):
frame = video.get_frame(i+1)
image = Image.fromarray((frame).astype(np.uint8))
audio = audio_np[sr*i:sr*(i+1)]
if flatten:
contents.extend(["", image, audio])
else:
contents.append(["", image, audio])
return contents
video_path="/path/to/video"
# if use voice clone prompt, please set ref_audio
ref_audio_path = 'assets/demo.wav'
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
# or use default prompt
# sys_msg = model.get_sys_prompt(mode='omni', language='en')
contents = get_video_chunk_content(video_path)
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]
# please set generate_audio=True and output_audio_path to save the tts result
generate_audio = True
output_audio_path = 'output.wav'
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
temperature=0.5,
max_new_tokens=4096,
omni_input=True, # please set omni_input=True when omni inference
use_tts_template=True,
generate_audio=generate_audio,
output_audio_path=output_audio_path,
max_slice_nums=1,
use_image_id=False,
return_dict=True
)
print(res)
```
#### Streaming inference
```python
# a new conversation need reset session first, it will reset the kv-cache
model.reset_session()
contents = get_video_chunk_content(video_path, flatten=False)
session_id = '123'
generate_audio = True
# 1. prefill system prompt
res = model.streaming_prefill(
session_id=session_id,
msgs=[sys_msg],
tokenizer=tokenizer
)
# 2. prefill video/audio chunks
for content in contents:
msgs = [{"role":"user", "content": content}]
res = model.streaming_prefill(
session_id=session_id,
msgs=msgs,
tokenizer=tokenizer
)
# 3. generate
res = model.streaming_generate(
session_id=session_id,
tokenizer=tokenizer,
temperature=0.5,
generate_audio=generate_audio
)
audios = []
text = ""
if generate_audio:
for r in res:
audio_wav = r.audio_wav
sampling_rate = r.sampling_rate
txt = r.text
audios.append(audio_wav)
text += txt
res = np.concatenate(audios)
sf.write("output.wav", res, samplerate=sampling_rate)
print("text:", text)
print("audio saved to output.wav")
else:
for r in res:
text += r['text']
print("text:", text)
```
### Audio-Only mode
#### Mimick
```python
mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
temperature=0.3,
generate_audio=True,
output_audio_path='output.wav', # save the tts result to output_audio_path
)
```
#### General Speech Conversation with Configurable Voices
Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.
```python
ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the reference audio
# Audio RolePlay: # With this mode, model will role-play the character based on the audio prompt.
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
# Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant.
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
# user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something!
```
```python
msgs = [sys_prompt, user_question]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
)
# round two
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result_round_2.wav',
)
print(res)
```
#### Addressing various audio tasks
Click to show Python code running MiniCPM-o 2.6 with specific audioQA task.
```python
'''
Audio Understanding Task Prompt:
Speech:
ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。
ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
General Audio:
Audio Caption: Summarize the main content of the audio.
Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
'''
task_prompt = "\n"
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
)
print(res)
```
```python
'''
Speech Generation Task Prompt:
Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/
Example:
# 在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。
# Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context.
Voice Cloning or Voice Creation: With this mode, model will act like a TTS model.
'''
# Human Instruction-to-Speech:
task_prompt = '' #Try to make some Human Instruction-to-Speech prompt
msgs = [{'role': 'user', 'content': [task_prompt]}] # you can try to use the same audio question
# Voice Cloning mode: With this mode, model will act like a TTS model.
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
# text_prompt = f"Please read the text below."
# user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Creation)
msgs = [sys_prompt, user_question]
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
max_new_tokens=128,
use_tts_template=True,
generate_audio=True,
temperature=0.3,
output_audio_path='result.wav',
)
```
### Vision-Only mode
`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
#### Chat with single image
```python
# test.py
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]
res = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(res)
## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
stream=True
)
generated_text = ""
for new_text in res:
generated_text += new_text
print(new_text, flush=True, end='')
```
#### Chat with multiple images
Click to show Python code running MiniCPM-o 2.6 with multiple images input.
```python
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
#### In-context few-shot learning
Click to view Python code running MiniCPM-o 2.6 with few-shot input.
```python
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
{'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
#### Chat with video
Click to view Python code running MiniCPM-o 2.6 with video input.
```python
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path ="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params={}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
```
Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-V) for more detail about usage.
## Inference with llama.cpp
MiniCPM-o 2.6 can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv) for more detail.
## Int4 quantized version
Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4).
## License
#### Model License
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
* The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use.
#### Statement
* As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
## Key Techniques and Other Multimodal Projects
👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team:
[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
## Citation
If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
```bib
@article{yao2024minicpm,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
journal={arXiv preprint arXiv:2408.01800},
year={2024}
}
```