finalf0 commited on
Commit
4ac69b4
Β·
1 Parent(s): e89b46b

update readme

Browse files
Files changed (1) hide show
  1. README.md +14 -7
README.md CHANGED
@@ -29,14 +29,14 @@ tags:
29
 
30
  ## MiniCPM-o 2.6
31
 
32
- **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
33
 
34
  - πŸ”₯ **Leading Visual Capability.**
35
  MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
36
 
37
- - πŸŽ™ **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
38
 
39
- - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding.
40
 
41
  - πŸ’ͺ **Strong OCR Capability and Others.**
42
  Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
@@ -47,7 +47,7 @@ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can pr
47
  In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
48
 
49
  - πŸ’« **Easy Usage.**
50
- MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
51
 
52
 
53
  **Model Architecture.**
@@ -60,6 +60,7 @@ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github
60
  <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework.png" , width=80%>
61
  </div>
62
 
 
63
  ### Evaluation <!-- omit in toc -->
64
 
65
  <div align="center">
@@ -562,7 +563,7 @@ Note: For proprietary models, we calculate token density based on the image enco
562
  <td colspan="11" align="left"><strong>Open-Source</strong></td>
563
  </tr>
564
  <tr>
565
- <td nowrap="nowrap" align="left">Qwen2-Audio</td>
566
  <td>8B</td>
567
  <td>-</td>
568
  <td>7.5</td>
@@ -814,7 +815,7 @@ All results are from AudioEvals, and the evaluation methods along with further d
814
  <td><strong>70.3</strong></td>
815
  </tr>
816
  <tr>
817
- <td nowrap="nowrap" align="left">GPT-4o</td>
818
  <td>-</td>
819
  <td>74.5</td>
820
  <td>51.0</td>
@@ -920,7 +921,13 @@ All results are from AudioEvals, and the evaluation methods along with further d
920
 
921
  ### Examples <!-- omit in toc -->
922
 
923
- We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
 
 
 
 
 
 
924
 
925
 
926
  <div style="display: flex; flex-direction: column; align-items: center;">
 
29
 
30
  ## MiniCPM-o 2.6
31
 
32
+ **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
33
 
34
  - πŸ”₯ **Leading Visual Capability.**
35
  MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
36
 
37
+ - πŸŽ™ **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
38
 
39
+ - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
40
 
41
  - πŸ’ͺ **Strong OCR Capability and Others.**
42
  Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
 
47
  In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
48
 
49
  - πŸ’« **Easy Usage.**
50
+ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
51
 
52
 
53
  **Model Architecture.**
 
60
  <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework.png" , width=80%>
61
  </div>
62
 
63
+
64
  ### Evaluation <!-- omit in toc -->
65
 
66
  <div align="center">
 
563
  <td colspan="11" align="left"><strong>Open-Source</strong></td>
564
  </tr>
565
  <tr>
566
+ <td nowrap="nowrap" align="left">Qwen2-Audio-Base</td>
567
  <td>8B</td>
568
  <td>-</td>
569
  <td>7.5</td>
 
815
  <td><strong>70.3</strong></td>
816
  </tr>
817
  <tr>
818
+ <td nowrap="nowrap" align="left">GPT-4o-202408</td>
819
  <td>-</td>
820
  <td>74.5</td>
821
  <td>51.0</td>
 
921
 
922
  ### Examples <!-- omit in toc -->
923
 
924
+ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
925
+
926
+ <div align="center">
927
+ <a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
928
+ </div>
929
+
930
+ <br>
931
 
932
 
933
  <div style="display: flex; flex-direction: column; align-items: center;">