openbmb
/

MiniCPM-Embedding-Light

@@ -11966,24 +11966,24 @@ tags:
 - sentence-transformers
 library_name: transformers
 ---
-## UltraRAG-Embedding
-**UltraRAG-Embedding** 是面壁智能与清华大学自然语言处理实验室（THUNLP）、东北大学信息检索小组（NEUIR）共同开发的中英双语言文本嵌入模型，有如下特点：
 - 出色的中文、英文检索能力。
 - 出色的中英跨语言检索能力。
 - 支持长文本（最长8192token）。
 - 提供稠密向量与token级别的稀疏向量。
 - 可变的稠密向量维度（套娃表征）。
-UltraRAG-Embedding结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式，共使用包括开源数据、机造数据、闭源数据在内的约 260M 条训练数据。
 欢迎关注 UltraRAG 系列：
-- 检索模型：[UltraRAG-Embedding](https://huggingface.co/openbmb/UltraRAG-Embedding)
-- 重排模型：[UltraRAG-Reranker](https://huggingface.co/openbmb/UltraRAG-Reranker)
 - 领域自适应RAG框架：[UltraRAG](https://github.com/openbmb/UltraRAG)
-**UltraRAG-Embedding** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:
 - Exceptional Chinese and English retrieval capabilities.
 - Outstanding cross-lingual retrieval capabilities between Chinese and English.
@@ -11991,12 +11991,12 @@ UltraRAG-Embedding结构上采取双向注意力和 Weighted Mean Pooling [1]。
 - Dense vectors and token-level sparse vectors.
 - Variable dense vector dimensions (Matryoshka representation).
-UltraRAG-Embedding incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 260 million training examples, including open-source, synthetic, and proprietary data.
 We also invite you to explore the UltraRAG series:
-- Retrieval Model: [UltraRAG-Embedding](https://huggingface.co/openbmb/UltraRAG-Embedding)
-- Re-ranking Model: [UltraRAG-Reranker](https://huggingface.co/openbmb/UltraRAG-Reranker)
 - Domain Adaptive RAG Framework: [UltraRAG](https://github.com/openbmb/UltraRAG)
 [1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
@@ -12017,7 +12017,7 @@ We also invite you to explore the UltraRAG series:
 本模型支持 query 侧指令，格式如下：
-UltraRAG-Embedding supports query-side instructions in the following format:
 ```
 Instruction: {{ instruction }} Query: {{ query }}
@@ -12037,7 +12037,7 @@ Instruction: Given a claim about climate change, retrieve documents that support
 也可以不提供指令，即采取如下格式：
-UltraRAG-Embedding also works in instruction-free mode in the following format:
 ```
 Query: {{ query }}
@@ -12056,7 +12056,7 @@ transformers==4.37.2
 from transformers import AutoModel
 import torch
-model_name = "OpenBMB/UltraRAG-Embedding"
 model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
 # you can use flash_attention_2 for faster inference
@@ -12087,7 +12087,7 @@ import torch
 from sentence_transformers import SentenceTransformer
-model_name = "openbmb/UltraRAG-Embedding"
 model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
 # you can use flash_attention_2 for faster inference
@@ -12113,7 +12113,7 @@ from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
 import numpy as np
 array = AsyncEngineArray.from_args([
-  EngineArgs(model_name_or_path = "OpenBMB/UltraRAG-Embedding", engine="torch", dtype="float16", bettertransformer=False, pooling_method="mean", trust_remote_code=True),
 ])
 queries = ["中国的首都是哪里？"] # "What is the capital of China?"
 passages = ["beijing", "shanghai"] # "北京", "上海"
@@ -12139,7 +12139,7 @@ print(scores.tolist())  # [[0.40356746315956116, 0.36183443665504456]]
 from FlagEmbedding import FlagModel
-model = FlagModel("OpenBMB/UltraRAG-Embedding",
                           query_instruction_for_retrieval="Query: ",
                           pooling_method="mean",
                           trust_remote_code=True,
@@ -12185,9 +12185,9 @@ print(scores.tolist())  # [[0.40356746315956116, 0.36183440685272217]]
 | jina-embeddings-v3                                 | 68.60              | 53.88         |
 | gte-Qwen2-1.5B-instruct                            | 71.86             | 58.29         |
 | MiniCPM-Embedding                                  | 76.76             | 58.56         |
-| UltraRAG-Embedding(Dense)                          | 72.71             | 55.27         |
-| UltraRAG-Embedding(Dense+Sparse)                   | 73.13             | 56.31         |
-| UltraRAG-Embedding(Dense+Sparse)+UltraRAG-Reranker | 76.34             | 61.49         |
 ### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
@@ -12198,15 +12198,15 @@ print(scores.tolist())  # [[0.40356746315956116, 0.36183440685272217]]
 | bge-m3(Dense)                               | 66.4               | 30.49              | 41.09              |
 | gte-multilingual-base(Dense)                | 68.2               | 39.46              | 45.86              |
 | MiniCPM-Embedding                           | 72.95              | 52.65              | 49.95              |
-| UltraRAG-Embedding(Dense)                   | 68.29              | 41.17              | 45.83              |
-| UltraRAG-Embedding(Dense)+UltraRAG-Reranker | 71.86              | 54.32              | 56.50               |
 ## 许可证 License
 - 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
-- UltraRAG-Embedding 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
-- UltraRAG-Embedding 模型权重对学术研究完全开放。如需将模型用于商业用途，请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
 * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
-* The usage of UltraRAG-Embedding model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
-* The models and weights of UltraRAG-Embedding are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, UltraRAG-Embedding weights are also available for free commercial use.

 - sentence-transformers
 library_name: transformers
 ---
+## MiniCPM-Embedding-Light
+**MiniCPM-Embedding-Light** 是面壁智能与清华大学自然语言处理实验室（THUNLP）、东北大学信息检索小组（NEUIR）共同开发的中英双语言文本嵌入模型，有如下特点：
 - 出色的中文、英文检索能力。
 - 出色的中英跨语言检索能力。
 - 支持长文本（最长8192token）。
 - 提供稠密向量与token级别的稀疏向量。
 - 可变的稠密向量维度（套娃表征）。
+MiniCPM-Embedding-Light结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式，共使用包括开源数据、机造数据、闭源数据在内的约 260M 条训练数据。
 欢迎关注 UltraRAG 系列：
+- 检索模型：[MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)
+- 重排模型：[MiniCPM-Reranker-Light](https://huggingface.co/openbmb/MiniCPM-Reranker-Light)
 - 领域自适应RAG框架：[UltraRAG](https://github.com/openbmb/UltraRAG)
+**MiniCPM-Embedding-Light** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:
 - Exceptional Chinese and English retrieval capabilities.
 - Outstanding cross-lingual retrieval capabilities between Chinese and English.
 - Dense vectors and token-level sparse vectors.
 - Variable dense vector dimensions (Matryoshka representation).
+MiniCPM-Embedding-Light incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 260 million training examples, including open-source, synthetic, and proprietary data.
 We also invite you to explore the UltraRAG series:
+- Retrieval Model: [MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)
+- Re-ranking Model: [MiniCPM-Reranker-Light](https://huggingface.co/openbmb/MiniCPM-Reranker-Light)
 - Domain Adaptive RAG Framework: [UltraRAG](https://github.com/openbmb/UltraRAG)
 [1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
 本模型支持 query 侧指令，格式如下：
+MiniCPM-Embedding-Light supports query-side instructions in the following format:
 ```
 Instruction: {{ instruction }} Query: {{ query }}
 也可以不提供指令，即采取如下格式：
+MiniCPM-Embedding-Light also works in instruction-free mode in the following format:
 ```
 Query: {{ query }}
 from transformers import AutoModel
 import torch
+model_name = "OpenBMB/MiniCPM-Embedding-Light"
 model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
 # you can use flash_attention_2 for faster inference
 from sentence_transformers import SentenceTransformer
+model_name = "openbmb/MiniCPM-Embedding-Light"
 model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
 # you can use flash_attention_2 for faster inference
 import numpy as np
 array = AsyncEngineArray.from_args([
+  EngineArgs(model_name_or_path = "OpenBMB/MiniCPM-Embedding-Light", engine="torch", dtype="float16", bettertransformer=False, pooling_method="mean", trust_remote_code=True),
 ])
 queries = ["中国的首都是哪里？"] # "What is the capital of China?"
 passages = ["beijing", "shanghai"] # "北京", "上海"
 from FlagEmbedding import FlagModel
+model = FlagModel("OpenBMB/MiniCPM-Embedding-Light",
                           query_instruction_for_retrieval="Query: ",
                           pooling_method="mean",
                           trust_remote_code=True,
 | jina-embeddings-v3                                 | 68.60              | 53.88         |
 | gte-Qwen2-1.5B-instruct                            | 71.86             | 58.29         |
 | MiniCPM-Embedding                                  | 76.76             | 58.56         |
+| MiniCPM-Embedding-Light(Dense)                          | 72.71             | 55.27         |
+| MiniCPM-Embedding-Light(Dense+Sparse)                   | 73.13             | 56.31         |
+| MiniCPM-Embedding-Light(Dense+Sparse)+MiniCPM-Reranker-Light | 76.34             | 61.49         |
 ### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
 | bge-m3(Dense)                               | 66.4               | 30.49              | 41.09              |
 | gte-multilingual-base(Dense)                | 68.2               | 39.46              | 45.86              |
 | MiniCPM-Embedding                           | 72.95              | 52.65              | 49.95              |
+| MiniCPM-Embedding-Light(Dense)                   | 68.29              | 41.17              | 45.83              |
+| MiniCPM-Embedding-Light(Dense)+MiniCPM-Reranker-Light | 71.86              | 54.32              | 56.50               |
 ## 许可证 License
 - 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
+- MiniCPM-Embedding-Light 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
+- MiniCPM-Embedding-Light 模型权重对学术研究完全开放。如需将模型用于商业用途，请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
 * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
+* The usage of MiniCPM-Embedding-Light model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
+* The models and weights of MiniCPM-Embedding-Light are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-Embedding-Light weights are also available for free commercial use.