readme
Browse files
README.md
CHANGED
@@ -11966,24 +11966,24 @@ tags:
|
|
11966 |
- sentence-transformers
|
11967 |
library_name: transformers
|
11968 |
---
|
11969 |
-
##
|
11970 |
|
11971 |
-
**
|
11972 |
- 出色的中文、英文检索能力。
|
11973 |
- 出色的中英跨语言检索能力。
|
11974 |
- 支持长文本(最长8192token)。
|
11975 |
- 提供稠密向量与token级别的稀疏向量。
|
11976 |
- 可变的稠密向量维度(套娃表征)。
|
11977 |
|
11978 |
-
|
11979 |
|
11980 |
欢迎关注 UltraRAG 系列:
|
11981 |
|
11982 |
-
- 检索模型:[
|
11983 |
-
- 重排模型:[
|
11984 |
- 领域自适应RAG框架:[UltraRAG](https://github.com/openbmb/UltraRAG)
|
11985 |
|
11986 |
-
**
|
11987 |
|
11988 |
- Exceptional Chinese and English retrieval capabilities.
|
11989 |
- Outstanding cross-lingual retrieval capabilities between Chinese and English.
|
@@ -11991,12 +11991,12 @@ UltraRAG-Embedding结构上采取双向注意力和 Weighted Mean Pooling [1]。
|
|
11991 |
- Dense vectors and token-level sparse vectors.
|
11992 |
- Variable dense vector dimensions (Matryoshka representation).
|
11993 |
|
11994 |
-
|
11995 |
|
11996 |
We also invite you to explore the UltraRAG series:
|
11997 |
|
11998 |
-
- Retrieval Model: [
|
11999 |
-
- Re-ranking Model: [
|
12000 |
- Domain Adaptive RAG Framework: [UltraRAG](https://github.com/openbmb/UltraRAG)
|
12001 |
|
12002 |
[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
|
@@ -12017,7 +12017,7 @@ We also invite you to explore the UltraRAG series:
|
|
12017 |
|
12018 |
本模型支持 query 侧指令,格式如下:
|
12019 |
|
12020 |
-
|
12021 |
|
12022 |
```
|
12023 |
Instruction: {{ instruction }} Query: {{ query }}
|
@@ -12037,7 +12037,7 @@ Instruction: Given a claim about climate change, retrieve documents that support
|
|
12037 |
|
12038 |
也可以不提供指令,即采取如下格式:
|
12039 |
|
12040 |
-
|
12041 |
|
12042 |
```
|
12043 |
Query: {{ query }}
|
@@ -12056,7 +12056,7 @@ transformers==4.37.2
|
|
12056 |
from transformers import AutoModel
|
12057 |
import torch
|
12058 |
|
12059 |
-
model_name = "OpenBMB/
|
12060 |
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
|
12061 |
|
12062 |
# you can use flash_attention_2 for faster inference
|
@@ -12087,7 +12087,7 @@ import torch
|
|
12087 |
from sentence_transformers import SentenceTransformer
|
12088 |
|
12089 |
|
12090 |
-
model_name = "openbmb/
|
12091 |
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
|
12092 |
|
12093 |
# you can use flash_attention_2 for faster inference
|
@@ -12113,7 +12113,7 @@ from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
|
|
12113 |
import numpy as np
|
12114 |
|
12115 |
array = AsyncEngineArray.from_args([
|
12116 |
-
EngineArgs(model_name_or_path = "OpenBMB/
|
12117 |
])
|
12118 |
queries = ["中国的首都是哪里?"] # "What is the capital of China?"
|
12119 |
passages = ["beijing", "shanghai"] # "北京", "上海"
|
@@ -12139,7 +12139,7 @@ print(scores.tolist()) # [[0.40356746315956116, 0.36183443665504456]]
|
|
12139 |
from FlagEmbedding import FlagModel
|
12140 |
|
12141 |
|
12142 |
-
model = FlagModel("OpenBMB/
|
12143 |
query_instruction_for_retrieval="Query: ",
|
12144 |
pooling_method="mean",
|
12145 |
trust_remote_code=True,
|
@@ -12185,9 +12185,9 @@ print(scores.tolist()) # [[0.40356746315956116, 0.36183440685272217]]
|
|
12185 |
| jina-embeddings-v3 | 68.60 | 53.88 |
|
12186 |
| gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
|
12187 |
| MiniCPM-Embedding | 76.76 | 58.56 |
|
12188 |
-
|
|
12189 |
-
|
|
12190 |
-
|
|
12191 |
|
12192 |
|
12193 |
### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
|
@@ -12198,15 +12198,15 @@ print(scores.tolist()) # [[0.40356746315956116, 0.36183440685272217]]
|
|
12198 |
| bge-m3(Dense) | 66.4 | 30.49 | 41.09 |
|
12199 |
| gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
|
12200 |
| MiniCPM-Embedding | 72.95 | 52.65 | 49.95 |
|
12201 |
-
|
|
12202 |
-
|
|
12203 |
|
12204 |
## 许可证 License
|
12205 |
|
12206 |
- 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
|
12207 |
-
-
|
12208 |
-
-
|
12209 |
|
12210 |
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
12211 |
-
* The usage of
|
12212 |
-
* The models and weights of
|
|
|
11966 |
- sentence-transformers
|
11967 |
library_name: transformers
|
11968 |
---
|
11969 |
+
## MiniCPM-Embedding-Light
|
11970 |
|
11971 |
+
**MiniCPM-Embedding-Light** 是面壁智能与清华大学自然语言处理实验室(THUNLP)、东北大学信息检索小组(NEUIR)共同开发的中英双语言文本嵌入模型,有如下特点:
|
11972 |
- 出色的中文、英文检索能力。
|
11973 |
- 出色的中英跨语言检索能力。
|
11974 |
- 支持长文本(最长8192token)。
|
11975 |
- 提供稠密向量与token级别的稀疏向量。
|
11976 |
- 可变的稠密向量维度(套娃表征)。
|
11977 |
|
11978 |
+
MiniCPM-Embedding-Light结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 260M 条训练数据。
|
11979 |
|
11980 |
欢迎关注 UltraRAG 系列:
|
11981 |
|
11982 |
+
- 检索模型:[MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)
|
11983 |
+
- 重排模型:[MiniCPM-Reranker-Light](https://huggingface.co/openbmb/MiniCPM-Reranker-Light)
|
11984 |
- 领域自适应RAG框架:[UltraRAG](https://github.com/openbmb/UltraRAG)
|
11985 |
|
11986 |
+
**MiniCPM-Embedding-Light** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:
|
11987 |
|
11988 |
- Exceptional Chinese and English retrieval capabilities.
|
11989 |
- Outstanding cross-lingual retrieval capabilities between Chinese and English.
|
|
|
11991 |
- Dense vectors and token-level sparse vectors.
|
11992 |
- Variable dense vector dimensions (Matryoshka representation).
|
11993 |
|
11994 |
+
MiniCPM-Embedding-Light incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 260 million training examples, including open-source, synthetic, and proprietary data.
|
11995 |
|
11996 |
We also invite you to explore the UltraRAG series:
|
11997 |
|
11998 |
+
- Retrieval Model: [MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)
|
11999 |
+
- Re-ranking Model: [MiniCPM-Reranker-Light](https://huggingface.co/openbmb/MiniCPM-Reranker-Light)
|
12000 |
- Domain Adaptive RAG Framework: [UltraRAG](https://github.com/openbmb/UltraRAG)
|
12001 |
|
12002 |
[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
|
|
|
12017 |
|
12018 |
本模型支持 query 侧指令,格式如下:
|
12019 |
|
12020 |
+
MiniCPM-Embedding-Light supports query-side instructions in the following format:
|
12021 |
|
12022 |
```
|
12023 |
Instruction: {{ instruction }} Query: {{ query }}
|
|
|
12037 |
|
12038 |
也可以不提供指令,即采取如下格式:
|
12039 |
|
12040 |
+
MiniCPM-Embedding-Light also works in instruction-free mode in the following format:
|
12041 |
|
12042 |
```
|
12043 |
Query: {{ query }}
|
|
|
12056 |
from transformers import AutoModel
|
12057 |
import torch
|
12058 |
|
12059 |
+
model_name = "OpenBMB/MiniCPM-Embedding-Light"
|
12060 |
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
|
12061 |
|
12062 |
# you can use flash_attention_2 for faster inference
|
|
|
12087 |
from sentence_transformers import SentenceTransformer
|
12088 |
|
12089 |
|
12090 |
+
model_name = "openbmb/MiniCPM-Embedding-Light"
|
12091 |
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
|
12092 |
|
12093 |
# you can use flash_attention_2 for faster inference
|
|
|
12113 |
import numpy as np
|
12114 |
|
12115 |
array = AsyncEngineArray.from_args([
|
12116 |
+
EngineArgs(model_name_or_path = "OpenBMB/MiniCPM-Embedding-Light", engine="torch", dtype="float16", bettertransformer=False, pooling_method="mean", trust_remote_code=True),
|
12117 |
])
|
12118 |
queries = ["中国的首都是哪里?"] # "What is the capital of China?"
|
12119 |
passages = ["beijing", "shanghai"] # "北京", "上海"
|
|
|
12139 |
from FlagEmbedding import FlagModel
|
12140 |
|
12141 |
|
12142 |
+
model = FlagModel("OpenBMB/MiniCPM-Embedding-Light",
|
12143 |
query_instruction_for_retrieval="Query: ",
|
12144 |
pooling_method="mean",
|
12145 |
trust_remote_code=True,
|
|
|
12185 |
| jina-embeddings-v3 | 68.60 | 53.88 |
|
12186 |
| gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
|
12187 |
| MiniCPM-Embedding | 76.76 | 58.56 |
|
12188 |
+
| MiniCPM-Embedding-Light(Dense) | 72.71 | 55.27 |
|
12189 |
+
| MiniCPM-Embedding-Light(Dense+Sparse) | 73.13 | 56.31 |
|
12190 |
+
| MiniCPM-Embedding-Light(Dense+Sparse)+MiniCPM-Reranker-Light | 76.34 | 61.49 |
|
12191 |
|
12192 |
|
12193 |
### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
|
|
|
12198 |
| bge-m3(Dense) | 66.4 | 30.49 | 41.09 |
|
12199 |
| gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
|
12200 |
| MiniCPM-Embedding | 72.95 | 52.65 | 49.95 |
|
12201 |
+
| MiniCPM-Embedding-Light(Dense) | 68.29 | 41.17 | 45.83 |
|
12202 |
+
| MiniCPM-Embedding-Light(Dense)+MiniCPM-Reranker-Light | 71.86 | 54.32 | 56.50 |
|
12203 |
|
12204 |
## 许可证 License
|
12205 |
|
12206 |
- 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
|
12207 |
+
- MiniCPM-Embedding-Light 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
|
12208 |
+
- MiniCPM-Embedding-Light 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
|
12209 |
|
12210 |
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
12211 |
+
* The usage of MiniCPM-Embedding-Light model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
|
12212 |
+
* The models and weights of MiniCPM-Embedding-Light are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-Embedding-Light weights are also available for free commercial use.
|