Kaguya-19 commited on
Commit
95ba296
·
1 Parent(s): 3ab2b19
Files changed (1) hide show
  1. README.md +24 -24
README.md CHANGED
@@ -11966,24 +11966,24 @@ tags:
11966
  - sentence-transformers
11967
  library_name: transformers
11968
  ---
11969
- ## UltraRAG-Embedding
11970
 
11971
- **UltraRAG-Embedding** 是面壁智能与清华大学自然语言处理实验室(THUNLP)、东北大学信息检索小组(NEUIR)共同开发的中英双语言文本嵌入模型,有如下特点:
11972
  - 出色的中文、英文检索能力。
11973
  - 出色的中英跨语言检索能力。
11974
  - 支持长文本(最长8192token)。
11975
  - 提供稠密向量与token级别的稀疏向量。
11976
  - 可变的稠密向量维度(套娃表征)。
11977
 
11978
- UltraRAG-Embedding结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 260M 条训练数据。
11979
 
11980
  欢迎关注 UltraRAG 系列:
11981
 
11982
- - 检索模型:[UltraRAG-Embedding](https://huggingface.co/openbmb/UltraRAG-Embedding)
11983
- - 重排模型:[UltraRAG-Reranker](https://huggingface.co/openbmb/UltraRAG-Reranker)
11984
  - 领域自适应RAG框架:[UltraRAG](https://github.com/openbmb/UltraRAG)
11985
 
11986
- **UltraRAG-Embedding** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:
11987
 
11988
  - Exceptional Chinese and English retrieval capabilities.
11989
  - Outstanding cross-lingual retrieval capabilities between Chinese and English.
@@ -11991,12 +11991,12 @@ UltraRAG-Embedding结构上采取双向注意力和 Weighted Mean Pooling [1]。
11991
  - Dense vectors and token-level sparse vectors.
11992
  - Variable dense vector dimensions (Matryoshka representation).
11993
 
11994
- UltraRAG-Embedding incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 260 million training examples, including open-source, synthetic, and proprietary data.
11995
 
11996
  We also invite you to explore the UltraRAG series:
11997
 
11998
- - Retrieval Model: [UltraRAG-Embedding](https://huggingface.co/openbmb/UltraRAG-Embedding)
11999
- - Re-ranking Model: [UltraRAG-Reranker](https://huggingface.co/openbmb/UltraRAG-Reranker)
12000
  - Domain Adaptive RAG Framework: [UltraRAG](https://github.com/openbmb/UltraRAG)
12001
 
12002
  [1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
@@ -12017,7 +12017,7 @@ We also invite you to explore the UltraRAG series:
12017
 
12018
  本模型支持 query 侧指令,格式如下:
12019
 
12020
- UltraRAG-Embedding supports query-side instructions in the following format:
12021
 
12022
  ```
12023
  Instruction: {{ instruction }} Query: {{ query }}
@@ -12037,7 +12037,7 @@ Instruction: Given a claim about climate change, retrieve documents that support
12037
 
12038
  也可以不提供指令,即采取如下格式:
12039
 
12040
- UltraRAG-Embedding also works in instruction-free mode in the following format:
12041
 
12042
  ```
12043
  Query: {{ query }}
@@ -12056,7 +12056,7 @@ transformers==4.37.2
12056
  from transformers import AutoModel
12057
  import torch
12058
 
12059
- model_name = "OpenBMB/UltraRAG-Embedding"
12060
  model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
12061
 
12062
  # you can use flash_attention_2 for faster inference
@@ -12087,7 +12087,7 @@ import torch
12087
  from sentence_transformers import SentenceTransformer
12088
 
12089
 
12090
- model_name = "openbmb/UltraRAG-Embedding"
12091
  model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
12092
 
12093
  # you can use flash_attention_2 for faster inference
@@ -12113,7 +12113,7 @@ from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
12113
  import numpy as np
12114
 
12115
  array = AsyncEngineArray.from_args([
12116
- EngineArgs(model_name_or_path = "OpenBMB/UltraRAG-Embedding", engine="torch", dtype="float16", bettertransformer=False, pooling_method="mean", trust_remote_code=True),
12117
  ])
12118
  queries = ["中国的首都是哪里?"] # "What is the capital of China?"
12119
  passages = ["beijing", "shanghai"] # "北京", "上海"
@@ -12139,7 +12139,7 @@ print(scores.tolist()) # [[0.40356746315956116, 0.36183443665504456]]
12139
  from FlagEmbedding import FlagModel
12140
 
12141
 
12142
- model = FlagModel("OpenBMB/UltraRAG-Embedding",
12143
  query_instruction_for_retrieval="Query: ",
12144
  pooling_method="mean",
12145
  trust_remote_code=True,
@@ -12185,9 +12185,9 @@ print(scores.tolist()) # [[0.40356746315956116, 0.36183440685272217]]
12185
  | jina-embeddings-v3 | 68.60 | 53.88 |
12186
  | gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
12187
  | MiniCPM-Embedding | 76.76 | 58.56 |
12188
- | UltraRAG-Embedding(Dense) | 72.71 | 55.27 |
12189
- | UltraRAG-Embedding(Dense+Sparse) | 73.13 | 56.31 |
12190
- | UltraRAG-Embedding(Dense+Sparse)+UltraRAG-Reranker | 76.34 | 61.49 |
12191
 
12192
 
12193
  ### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
@@ -12198,15 +12198,15 @@ print(scores.tolist()) # [[0.40356746315956116, 0.36183440685272217]]
12198
  | bge-m3(Dense) | 66.4 | 30.49 | 41.09 |
12199
  | gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
12200
  | MiniCPM-Embedding | 72.95 | 52.65 | 49.95 |
12201
- | UltraRAG-Embedding(Dense) | 68.29 | 41.17 | 45.83 |
12202
- | UltraRAG-Embedding(Dense)+UltraRAG-Reranker | 71.86 | 54.32 | 56.50 |
12203
 
12204
  ## 许可证 License
12205
 
12206
  - 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
12207
- - UltraRAG-Embedding 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
12208
- - UltraRAG-Embedding 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
12209
 
12210
  * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
12211
- * The usage of UltraRAG-Embedding model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
12212
- * The models and weights of UltraRAG-Embedding are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, UltraRAG-Embedding weights are also available for free commercial use.
 
11966
  - sentence-transformers
11967
  library_name: transformers
11968
  ---
11969
+ ## MiniCPM-Embedding-Light
11970
 
11971
+ **MiniCPM-Embedding-Light** 是面壁智能与清华大学自然语言处理实验室(THUNLP)、东北大学信息检索小组(NEUIR)共同开发的中英双语言文本嵌入模型,有如下特点:
11972
  - 出色的中文、英文检索能力。
11973
  - 出色的中英跨语言检索能力。
11974
  - 支持长文本(最长8192token)。
11975
  - 提供稠密向量与token级别的稀疏向量。
11976
  - 可变的稠密向量维度(套娃表征)。
11977
 
11978
+ MiniCPM-Embedding-Light结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 260M 条训练数据。
11979
 
11980
  欢迎关注 UltraRAG 系列:
11981
 
11982
+ - 检索模型:[MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)
11983
+ - 重排模型:[MiniCPM-Reranker-Light](https://huggingface.co/openbmb/MiniCPM-Reranker-Light)
11984
  - 领域自适应RAG框架:[UltraRAG](https://github.com/openbmb/UltraRAG)
11985
 
11986
+ **MiniCPM-Embedding-Light** is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:
11987
 
11988
  - Exceptional Chinese and English retrieval capabilities.
11989
  - Outstanding cross-lingual retrieval capabilities between Chinese and English.
 
11991
  - Dense vectors and token-level sparse vectors.
11992
  - Variable dense vector dimensions (Matryoshka representation).
11993
 
11994
+ MiniCPM-Embedding-Light incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 260 million training examples, including open-source, synthetic, and proprietary data.
11995
 
11996
  We also invite you to explore the UltraRAG series:
11997
 
11998
+ - Retrieval Model: [MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)
11999
+ - Re-ranking Model: [MiniCPM-Reranker-Light](https://huggingface.co/openbmb/MiniCPM-Reranker-Light)
12000
  - Domain Adaptive RAG Framework: [UltraRAG](https://github.com/openbmb/UltraRAG)
12001
 
12002
  [1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
 
12017
 
12018
  本模型支持 query 侧指令,格式如下:
12019
 
12020
+ MiniCPM-Embedding-Light supports query-side instructions in the following format:
12021
 
12022
  ```
12023
  Instruction: {{ instruction }} Query: {{ query }}
 
12037
 
12038
  也可以不提供指令,即采取如下格式:
12039
 
12040
+ MiniCPM-Embedding-Light also works in instruction-free mode in the following format:
12041
 
12042
  ```
12043
  Query: {{ query }}
 
12056
  from transformers import AutoModel
12057
  import torch
12058
 
12059
+ model_name = "OpenBMB/MiniCPM-Embedding-Light"
12060
  model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
12061
 
12062
  # you can use flash_attention_2 for faster inference
 
12087
  from sentence_transformers import SentenceTransformer
12088
 
12089
 
12090
+ model_name = "openbmb/MiniCPM-Embedding-Light"
12091
  model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
12092
 
12093
  # you can use flash_attention_2 for faster inference
 
12113
  import numpy as np
12114
 
12115
  array = AsyncEngineArray.from_args([
12116
+ EngineArgs(model_name_or_path = "OpenBMB/MiniCPM-Embedding-Light", engine="torch", dtype="float16", bettertransformer=False, pooling_method="mean", trust_remote_code=True),
12117
  ])
12118
  queries = ["中国的首都是哪里?"] # "What is the capital of China?"
12119
  passages = ["beijing", "shanghai"] # "北京", "上海"
 
12139
  from FlagEmbedding import FlagModel
12140
 
12141
 
12142
+ model = FlagModel("OpenBMB/MiniCPM-Embedding-Light",
12143
  query_instruction_for_retrieval="Query: ",
12144
  pooling_method="mean",
12145
  trust_remote_code=True,
 
12185
  | jina-embeddings-v3 | 68.60 | 53.88 |
12186
  | gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
12187
  | MiniCPM-Embedding | 76.76 | 58.56 |
12188
+ | MiniCPM-Embedding-Light(Dense) | 72.71 | 55.27 |
12189
+ | MiniCPM-Embedding-Light(Dense+Sparse) | 73.13 | 56.31 |
12190
+ | MiniCPM-Embedding-Light(Dense+Sparse)+MiniCPM-Reranker-Light | 76.34 | 61.49 |
12191
 
12192
 
12193
  ### 中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
 
12198
  | bge-m3(Dense) | 66.4 | 30.49 | 41.09 |
12199
  | gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
12200
  | MiniCPM-Embedding | 72.95 | 52.65 | 49.95 |
12201
+ | MiniCPM-Embedding-Light(Dense) | 68.29 | 41.17 | 45.83 |
12202
+ | MiniCPM-Embedding-Light(Dense)+MiniCPM-Reranker-Light | 71.86 | 54.32 | 56.50 |
12203
 
12204
  ## 许可证 License
12205
 
12206
  - 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
12207
+ - MiniCPM-Embedding-Light 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
12208
+ - MiniCPM-Embedding-Light 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。
12209
 
12210
  * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
12211
+ * The usage of MiniCPM-Embedding-Light model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
12212
+ * The models and weights of MiniCPM-Embedding-Light are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-Embedding-Light weights are also available for free commercial use.