openbmb
/

MiniCPM-o-2_6

Model card Files Files and versions Community

finalf0 commited on 3 days ago

Commit

4ac69b4

1 Parent(s): e89b46b

update readme

Browse files

Files changed (1) hide show

README.md +14 -7

README.md CHANGED Viewed

@@ -29,14 +29,14 @@ tags:
 ## MiniCPM-o 2.6
-**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
 - 🔥 **Leading Visual Capability.**
   MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
-- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
-- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding.
 - 💪 **Strong OCR Capability and Others.**
 Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
@@ -47,7 +47,7 @@ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can pr
   In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
 -  💫  **Easy Usage.**
-MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
 **Model Architecture.**
@@ -60,6 +60,7 @@ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github
 <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework.png" , width=80%>
 </div>
 ### Evaluation  <!-- omit in toc -->
 <div align="center">
@@ -562,7 +563,7 @@ Note: For proprietary models, we calculate token density based on the image enco
             <td colspan="11" align="left"><strong>Open-Source</strong></td>
         </tr>
         <tr>
-            <td nowrap="nowrap" align="left">Qwen2-Audio</td>
             <td>8B</td>
             <td>-</td>
             <td>7.5</td>
@@ -814,7 +815,7 @@ All results are from AudioEvals, and the evaluation methods along with further d
             <td><strong>70.3</strong></td>
         </tr>
         <tr>
-            <td nowrap="nowrap" align="left">GPT-4o</td>
             <td>-</td>
             <td>74.5</td>
             <td>51.0</td>
@@ -920,7 +921,13 @@ All results are from AudioEvals, and the evaluation methods along with further d
 ### Examples <!-- omit in toc -->
-We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
 <div style="display: flex; flex-direction: column; align-items: center;">

 ## MiniCPM-o 2.6
+**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
 - 🔥 **Leading Visual Capability.**
   MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
+- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
+- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
 - 💪 **Strong OCR Capability and Others.**
 Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
   In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
 -  💫  **Easy Usage.**
+MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
 **Model Architecture.**
 <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework.png" , width=80%>
 </div>
 ### Evaluation  <!-- omit in toc -->
 <div align="center">
             <td colspan="11" align="left"><strong>Open-Source</strong></td>
         </tr>
         <tr>
+            <td nowrap="nowrap" align="left">Qwen2-Audio-Base</td>
             <td>8B</td>
             <td>-</td>
             <td>7.5</td>
             <td><strong>70.3</strong></td>
         </tr>
         <tr>
+            <td nowrap="nowrap" align="left">GPT-4o-202408</td>
             <td>-</td>
             <td>74.5</td>
             <td>51.0</td>
 ### Examples <!-- omit in toc -->
+We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
+<div align="center">
+  <a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
+</div>
+<br>
 <div style="display: flex; flex-direction: column; align-items: center;">