--- pipeline_tag: image-text-to-text datasets: - openbmb/RLAIF-V-Dataset library_name: transformers language: - multilingual tags: - minicpm-o - omni - vision - ocr - multi-image - video - custom_code ---

A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

[GitHub](https://github.com/OpenBMB/MiniCPM-V) | Online Demo [US](https://minicpm-omni-webdemo-us.modelbest.cn)/[CN](https://minicpm-omni-webdemo.modelbest.cn) ## MiniCPM-o 2.6 **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include: - 🔥 **Leading Visual Capability.** MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability. - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, voice cloning, role play, etc. - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding. - 💪 **Strong OCR Capability and Others.** Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**. Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages. - 🚀 **Superior Efficiency.** In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad. - 💫 **Easy Usage.** MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server. **Model Architecture.** - **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge. - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices. - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.

### Evaluation

Click to view visual understanding results.

**Image Understanding**

Model	Size	Token Density⁺	OpenCompass	OCRBench	MathVista mini	ChartQA	MMVet	MMStar	MME	MMB1.1 test	AI2D	MMMU val	HallusionBench	TextVQA val	DocVQA test	MathVerse mini	MathVision	MMHal Score
Proprietary
GPT-4o-20240513	-	1088	69.9	736	61.3	85.7	69.1	63.9	2328.7	82.2	84.6	69.2	55.0	-	92.8	50.2	30.4	3.6
Claude3.5-Sonnet	-	750	67.9	788	61.6	90.8	66.0	62.2	1920.0	78.5	80.2	65.9	49.9	-	95.2	-	-	3.4
Gemini-1.5-Pro	-	-	64.4	754	57.7	81.3	64.0	59.1	2110.6	73.9	79.1	60.6	45.6	73.5	86.5	-	19.2	-
GPT-4o-mini-20240718	-	1088	64.1	785	52.4	-	66.9	54.8	2003.4	76.0	77.8	60.0	46.1	-	-	-	-	3.3
Open Source
Cambrian-34B	34B	1820	58.3	591	50.3	75.6	53.2	54.2	2049.9	77.8	79.5	50.4	41.6	76.7	75.5	-	-	-
GLM-4V-9B	13B	784	59.1	776	51.1	-	58.0	54.8	2018.8	67.9	71.2	46.9	45.0	-	-	-	-	-
Pixtral-12B	12B	256	61.0	685	56.9	81.8	58.5	54.5	-	72.7	79.0	51.1	47.0	75.7	90.7	-	-	-
DeepSeek-VL2-27B (4B)	27B	672	66.4	809	63.9	86.0	60.0	61.9	2253.0	81.2	83.8	54.0	45.3	84.2	93.3	-	-	3.0
Qwen2-VL-7B	8B	784	67.1	866	58.2	83.0	62.0	60.7	2326.0	81.8	83.0	54.1	50.6	84.3	94.5	31.9	16.3	3.2
LLaVA-OneVision-72B	72B	182	68.1	741	67.5	83.7	60.6	65.8	2261.0	85.0	85.6	56.8	49.0	80.5	91.3	39.1	-	3.5
InternVL-2.5-8B	8B	706	68.3	822	64.4	84.8	62.8	62.8	2344.0	83.6	84.5	56.0	50.1	79.1	93.0	39.5	19.7	3.4
MiniCPM-V 2.6	8B	2822	65.2	852*	60.6	79.4	60.0	57.5	2348.4*	78.0	82.1	49.8*	48.1*	80.1	90.8	25.7	18.3	3.6
MiniCPM-o 2.6	8B	2822	70.2	897*	71.9*	86.9*	67.5	64.0	2372.0*	80.5	85.8	50.4*	51.9	82.0	93.5	41.4*	23.1*	3.8

* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set. ⁺ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens. Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation. **Multi-image and Video Understanding**

Model	Size	BLINK-val	Mantis-Eval	MIRB	Video-MME (wo / w subs)
Proprietary
GPT-4o-20240513	-	68	-	-	71.9/77.2
GPT4V	-	54.6	62.7	53.1	59.9/63.3
Open-source
LLaVA-NeXT-Interleave 14B	14B	52.6	66.4	30.2	-
LLaVA-One-Vision-72B	72B	55.4	77.6	-	66.2/69.5
MANTIS 8B	8B	49.1	59.5	34.8	-
Qwen2-VL-7B	8B	53.2	69.6*	67.6*	63.3/69.0
InternVL-2.5-8B	8B	54.8	67.7	52.5	64.2/66.9
MiniCPM-V 2.6	8B	53	69.1	53.8	60.9/63.6
MiniCPM-o 2.6	8B	56.7	71.9	58.6	63.9/67.9

* We evaluate officially released checkpoints by ourselves.

Click to view audio understanding and speech conversation results.

**Audio Understanding**

Task	Size	ASR (zh)			ASR (en)			ASR		Emotion
Metric		CER↓			WER↓			BLEU↑		ACC↑
Dataset		AISHELL-1	Fleurs zh	WenetSpeech test-net	LibriSpeech test-clean	GigaSpeech	TED-LIUM	CoVoST en2zh	CoVoST zh2en	MELD emotion
Proprietary
GPT-4o-Realtime	-	7.3*	5.4*	28.9*	2.6*	12.9*	4.8*	37.1*	15.7*	33.2*
Gemini-1.5-Pro	-	4.5*	5.9*	14.3*	2.9*	10.6*	3.0*	47.3*	22.6*	48.4*
Open-Source
Qwen2-Audio	8B	-	7.5	-	1.6	-	-	45.2	24.4	55.3
Qwen2-Audio-Instruction	8B	2.6*	6.9*	10.3*	3.1*	9.7*	5.9*	39.5*	22.9*	17.4*
GLM-4-Voice-Base	9B	2.5	-	-	2.8	-	-	-	-
MiniCPM-o 2.6	8B	1.6	4.4	6.9	1.7	8.7	3.0	48.2	27.2	52.4

* We evaluate officially released checkpoints by ourselves.

**Speech Generation**

Task	Size	SpeechQA
Metric		ACC↑			G-Eval (10 point)↑	Semantic ELO score↑	Acoustic ELO score↑	Overall ELO score↑	UTMOS↑	ASR-WER↓
Dataset		Speech Llama Q.	Speech Web Q.	Speech Trivia QA	Speech AlpacaEval	AudioArena
Proprietary
GPT-4o-Realtime		71.7	51.6	69.7	7.4	1157	1203	1200	4.2	2.3
Open-Source
GLM-4-Voice	9B	50.0	32.0	36.4	5.1	999	1147	1035	4.1	11.7
Llama-Omni	8B	45.3	22.9	10.7	3.9	960	878	897	3.2	24.3
Moshi	7B	43.7	23.8	16.7	2.4	871	808	875	2.8	8.2
Mini-Omni	1B	22.0	12.8	6.9	2.5	926	803	865	3.4	10.0
MiniCPM-o 2.6	8B	61.0	40.0	40.2	5.1	1088	1163	1131	4.2	9.8

All results are from AudioEvals, and the evaluation methods along with further details can be found in AudioEvals.

**Voice Cloning**

Task	Voice cloning
Metric	SIMO↑	SIMO↑
Dataset	Seed-TTS test-zh	Seed-TTS test-en
F5-TTS	76	67
CosyVoice	75	64
FireRedTTS	63	46
MiniCPM-o 2.6	57	47

Note: Mimick Task: Takes audio input, and outputs both an ASR transcription and a voice imitation (TTS)

Click to view multimodal live streaming results.

**Multimodal Live Streaming**: results on StreamingBench

Model	Size	Real-Time Video Understanding	Omni-Source Understanding	Contextual Understanding	Overall
Proprietary
Gemini 1.5 Pro	-	77.4	67.8	51.1	70.3
GPT-4o	-	74.5	51.0	48.0	64.1
Claude-3.5-Sonnet	-	74.0	41.4	37.8	59.7
Open-source
VILA-1.5	8B	61.5	37.5	26.7	49.5
LongVA	7B	63.1	35.9	30.2	50.7
LLaVA-Next-Video-34B	34B	69.8	41.7	34.3	56.7
Qwen2-VL-7B	8B	71.2	40.7	33.1	57.0
InternVL2-8B	8B	70.1	42.7	34.1	57.0
VITA-1.5	8B	70.9	40.8	35.8	57.4
LLaVA-OneVision-7B	8B	74.3	40.8	31.0	58.4
InternLM-XC2.5-OL-7B	8B	75.4	46.2	33.6	60.8
MiniCPM-V 2.6	8B	72.4	40.2	33.4	57.7
MiniCPM-o 2.6	8B	79.9	53.4	38.5	66.0

### Examples We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.

## Online Demo Click here to try the online demo of **MiniCPM-o 2.6** on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn) server. ## Usage Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10： ``` Pillow==10.1.0 torch==2.2.0 torchaudio==2.2.0 torchvision==0.17.0 transformers==4.44.2 librosa==0.9.0 soundfile==0.12.1 vector-quantize-pytorch==1.18.5 vocos==0.1.0 decord moviepy ``` ### Model initialization ```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer # load omni model default, the default init_vision/init_audio/init_tts is True # if load vision-only model, please set init_audio=False and init_tts=False # if load audio-only model, please set init_vision=False model = AutoModel.from_pretrained( 'openbmb/MiniCPM-o-2_6', trust_remote_code=True, attn_implementation='sdpa', # sdpa or flash_attention_2 torch_dtype=torch.bfloat16, init_vision=True, init_audio=True, init_tts=True ) model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True) # In addition to vision-only mode, tts processor and vocos also needs to be initialized model.init_tts() model.tts.float() ``` ### Omni mode we provide two inference modes: chat and streaming #### Chat inference ```python import math import numpy as np from PIL import Image from moviepy.editor import VideoFileClip import tempfile import librosa import soundfile as sf def get_video_chunk_content(video_path, flatten=True): video = VideoFileClip(video_path) print('video_duration:', video.duration) with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file: temp_audio_file_path = temp_audio_file.name video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000) audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True) num_units = math.ceil(video.duration) # 1 frame + 1s audio chunk contents= [] for i in range(num_units): frame = video.get_frame(i+1) image = Image.fromarray((frame).astype(np.uint8)) audio = audio_np[sr*i:sr*(i+1)] if flatten: contents.extend(["", image, audio]) else: contents.append(["", image, audio]) return contents video_path="/path/to/video" # if use voice clone prompt, please set ref_audio ref_audio_path = 'assets/demo.wav' ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True) sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en') # or use default prompt # sys_msg = model.get_sys_prompt(mode='omni', language='en') contents = get_video_chunk_content(video_path) msg = {"role":"user", "content": contents} msgs = [sys_msg, msg] # please set generate_audio=True and output_audio_path to save the tts result generate_audio = True output_audio_path = 'output.wav' res = model.chat( msgs=msgs, tokenizer=tokenizer, sampling=True, temperature=0.5, max_new_tokens=4096, omni_input=True, # please set omni_input=True when omni inference use_tts_template=True, generate_audio=generate_audio, output_audio_path=output_audio_path, max_slice_nums=1, use_image_id=False, return_dict=True ) print(res) ``` #### Streaming inference ```python # a new conversation need reset session first, it will reset the kv-cache model.reset_session() contents = get_video_chunk_content(video_path, flatten=False) session_id = '123' generate_audio = True # 1. prefill system prompt res = model.streaming_prefill( session_id=session_id, msgs=[sys_msg], tokenizer=tokenizer ) # 2. prefill video/audio chunks for content in contents: msgs = [{"role":"user", "content": content}] res = model.streaming_prefill( session_id=session_id, msgs=msgs, tokenizer=tokenizer ) # 3. generate res = model.streaming_generate( session_id=session_id, tokenizer=tokenizer, temperature=0.5, generate_audio=generate_audio ) audios = [] text = "" if generate_audio: for r in res: audio_wav = r.audio_wav sampling_rate = r.sampling_rate txt = r.text audios.append(audio_wav) text += txt res = np.concatenate(audios) sf.write("output.wav", res, samplerate=sampling_rate) print("text:", text) print("audio saved to output.wav") else: for r in res: text += r['text'] print("text:", text) ``` ### Audio-Only mode #### Mimick ```python mimick_prompt = "Please repeat each user's speech, including voice style and speech content." audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True) msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}] res = model.chat( msgs=msgs, tokenizer=tokenizer, sampling=True, max_new_tokens=128, use_tts_template=True, temperature=0.3, generate_audio=True, output_audio_path='output.wav', # save the tts result to output_audio_path ) ``` #### General Speech Conversation with Configurable Voices

Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.

```python ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the reference audio # Audio RolePlay: # With this mode, model will role-play the character based on the audio prompt. sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en') user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') # user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something! ``` ```python msgs = [sys_prompt, user_question] res = model.chat( msgs=msgs, tokenizer=tokenizer, sampling=True, max_new_tokens=128, use_tts_template=True, generate_audio=True, temperature=0.3, output_audio_path='result.wav', ) # round two history = msgs.append({'role': 'assistant', 'content': res}) user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} msgs = history.append(user_question) res = model.chat( msgs=msgs, tokenizer=tokenizer, sampling=True, max_new_tokens=128, use_tts_template=True, generate_audio=True, temperature=0.3, output_audio_path='result_round_2.wav', ) print(res) ```

#### Addressing various audio tasks

Click to show Python code running MiniCPM-o 2.6 with specific audioQA task.

```python ''' Audio Understanding Task Prompt: Speech: ASR with ZH(same as AST en2zh): 请仔细听这段音频片段，并将其内容逐字记录。 ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content. Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status. General Audio: Audio Caption: Summarize the main content of the audio. Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene. ''' task_prompt = "\n" audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True) msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}] res = model.chat( msgs=msgs, tokenizer=tokenizer, sampling=True, max_new_tokens=128, use_tts_template=True, generate_audio=True, temperature=0.3, output_audio_path='result.wav', ) print(res) ``` ```python ''' Speech Generation Task Prompt: Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/ Example: # 在新闻中，一个年轻男性兴致勃勃地说：“祝福亲爱的祖国母亲美丽富强！”他用低音调和低音量，慢慢地说出了这句话。 # Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. Voice Cloning or Voice Creation: With this mode, model will act like a TTS model. ''' # Human Instruction-to-Speech: task_prompt = '' #Try to make some Human Instruction-to-Speech prompt msgs = [{'role': 'user', 'content': [task_prompt]}] # you can try to use the same audio question # Voice Cloning mode: With this mode, model will act like a TTS model. # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en') # text_prompt = f"Please read the text below." # user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning) # user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Creation) msgs = [sys_prompt, user_question] res = model.chat( msgs=msgs, tokenizer=tokenizer, sampling=True, max_new_tokens=128, use_tts_template=True, generate_audio=True, temperature=0.3, output_audio_path='result.wav', ) ```

### Vision-Only mode `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6` #### Chat with single image ```python # test.py image = Image.open('xx.jpg').convert('RGB') question = 'What is in the image?' msgs = [{'role': 'user', 'content': [image, question]}] res = model.chat( image=None, msgs=msgs, tokenizer=tokenizer ) print(res) ## if you want to use streaming, please make sure sampling=True and stream=True ## the model.chat will return a generator res = model.chat( msgs=msgs, tokenizer=tokenizer, sampling=True, stream=True ) generated_text = "" for new_text in res: generated_text += new_text print(new_text, flush=True, end='') ``` #### Chat with multiple images

Click to show Python code running MiniCPM-o 2.6 with multiple images input.

```python image1 = Image.open('image1.jpg').convert('RGB') image2 = Image.open('image2.jpg').convert('RGB') question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.' msgs = [{'role': 'user', 'content': [image1, image2, question]}] answer = model.chat( msgs=msgs, tokenizer=tokenizer ) print(answer) ```

#### In-context few-shot learning

Click to view Python code running MiniCPM-o 2.6 with few-shot input.

```python question = "production date" image1 = Image.open('example1.jpg').convert('RGB') answer1 = "2023.08.04" image2 = Image.open('example2.jpg').convert('RGB') answer2 = "2007.04.24" image_test = Image.open('test.jpg').convert('RGB') msgs = [ {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]}, {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]}, {'role': 'user', 'content': [image_test, question]} ] answer = model.chat( msgs=msgs, tokenizer=tokenizer ) print(answer) ```

#### Chat with video

Click to view Python code running MiniCPM-o 2.6 with video input.

```python MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number def encode_video(video_path): def uniform_sample(l, n): gap = len(l) / n idxs = [int(i * gap + gap / 2) for i in range(n)] return [l[i] for i in idxs] vr = VideoReader(video_path, ctx=cpu(0)) sample_fps = round(vr.get_avg_fps() / 1) # FPS frame_idx = [i for i in range(0, len(vr), sample_fps)] if len(frame_idx) > MAX_NUM_FRAMES: frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) frames = vr.get_batch(frame_idx).asnumpy() frames = [Image.fromarray(v.astype('uint8')) for v in frames] print('num frames:', len(frames)) return frames video_path ="video_test.mp4" frames = encode_video(video_path) question = "Describe the video" msgs = [ {'role': 'user', 'content': frames + [question]}, ] # Set decode params for video params={} params["use_image_id"] = False params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448 answer = model.chat( msgs=msgs, tokenizer=tokenizer, **params ) print(answer) ```

Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-V) for more detail about usage. ## Inference with llama.cpp MiniCPM-o 2.6 can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv) for more detail. ## Int4 quantized version Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4). ## License #### Model License * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. * The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). * The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use. #### Statement * As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. ## Key Techniques and Other Multimodal Projects 👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team: [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) ## Citation If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！ ```bib @article{yao2024minicpm, title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, journal={arXiv preprint arXiv:2408.01800}, year={2024} } ```