maidalun1020 commited on
Commit
64a29ef
·
1 Parent(s): 24e934d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -13
README.md CHANGED
@@ -30,6 +30,8 @@ license: apache-2.0
30
  2、RAG优化,适配更多真实业务场景;
31
  3、适配长文本做rerank。
32
 
 
 
33
  -----------------------------------------
34
  <details open="open">
35
  <summary>Click to Open Contents</summary>
@@ -40,7 +42,8 @@ license: apache-2.0
40
  - <a href="#-model-list" target="_Self">🍎 Model List</a>
41
  - <a href="#-manual" target="_Self">📖 Manual</a>
42
  - <a href="#installation" target="_Self">Installation</a>
43
- - <a href="#quick-start" target="_Self">Quick Start</a>
 
44
  - <a href="#%EF%B8%8F-evaluation" target="_Self">⚙️ Evaluation</a>
45
  - <a href="#evaluate-semantic-representation-by-mteb" target="_Self">Evaluate Semantic Representation by MTEB</a>
46
  - <a href="#evaluate-rag-by-llamaindex" target="_Self">Evaluate RAG by LlamaIndex</a>
@@ -134,17 +137,20 @@ Existing embedding models often encounter performance challenges in bilingual an
134
  ### Installation
135
 
136
  First, create a conda environment and activate it.
 
137
  ```bash
138
  conda create --name bce python=3.10 -y
139
  conda activate bce
140
  ```
141
 
142
- Then install `BCEmbedding`:
 
143
  ```bash
144
- pip install git+https://github.com/netease-youdao/BCEmbedding.git
145
  ```
146
 
147
  Or install from source:
 
148
  ```bash
149
  git clone [email protected]:netease-youdao/BCEmbedding.git
150
  cd BCEmbedding
@@ -153,7 +159,9 @@ pip install -v -e .
153
 
154
  ### Quick Start
155
 
156
- Use `EmbeddingModel` by `BCEmbedding`, and `cls` [pooler](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/models/embedding.py#L24) is default.
 
 
157
 
158
  ```python
159
  from BCEmbedding import EmbeddingModel
@@ -168,7 +176,7 @@ model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")
168
  embeddings = model.encode(sentences)
169
  ```
170
 
171
- Use `RerankerModel` by `BCEmbedding` to calculate relevant scores and rerank:
172
 
173
  ```python
174
  from BCEmbedding import RerankerModel
@@ -190,6 +198,164 @@ scores = model.compute_score(sentence_pairs)
190
  rerank_results = model.rerank(query, passages)
191
  ```
192
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  ## ⚙️ Evaluation
194
 
195
  ### Evaluate Semantic Representation by MTEB
@@ -200,9 +366,9 @@ We provide evaluateion tools for `embedding` and `reranker` models, based on [MT
200
 
201
  #### 1. Embedding Models
202
 
203
- Just run following cmd to evaluate `your_embedding_model` (e.g. `maidalun1020/bce-embedding-base_v1`) in **monolingual, bilingual and crosslingual settings** (e.g. `["en", "zh", "en-zh", "zh-en"]`).
204
 
205
- 运行下面命令评测`your_embedding_model`(比如,`maidalun1020/bce-embedding-base_v1`)。评测任务将会在**单语种,双语种和跨语种**(比如,`["en", "zh", "en-zh", "zh-en"]`)模式下评测:
206
 
207
  ```bash
208
  python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls
@@ -213,8 +379,11 @@ The total evaluation tasks contain ***114 datastes*** of **"Retrieval", "STS", "
213
  评测包含 **"Retrieval", "STS", "PairClassification", "Classification", "Reranking"和"Clustering"** 这六大类任务的 ***114个数据集***。
214
 
215
  ***NOTE:***
216
- - All models are evaluated in their **recommended pooling method (`pooler`)**. "jina-embeddings-v2-base-en", "m3e-base" and "m3e-large" use `mean` pooler, while the others use `cls`.
 
 
217
  - "jina-embeddings-v2-base-en" model should be loaded with `trust_remote_code`.
 
218
  ```bash
219
  python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {moka-ai/m3e-base | moka-ai/m3e-large} --pooler mean
220
 
@@ -222,14 +391,14 @@ python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path j
222
  ```
223
 
224
  ***注意:***
225
- - 所有模型的评测采用各自推荐的`pooler`。"jina-embeddings-v2-base-en" "m3e-base""m3e-large"的 `pooler`采用`mean`,其他模型的`pooler`采用`cls`.
226
  - "jina-embeddings-v2-base-en"模型在载入时需要`trust_remote_code`。
227
 
228
  #### 2. Reranker Models
229
 
230
- Run following cmd to evaluate `your_reranker_model` (e.g. "maidalun1020/bce-reranker-base_v1") in **monolingual, bilingual and crosslingual settings** (e.g. `["en", "zh", "en-zh", "zh-en"]`).
231
 
232
- 运行下面命令评测`your_reranker_model`(比如,`maidalun1020/bce-reranker-base_v1`)。评测任务将会在**单语种,双语种和跨语种**(比如,`["en", "zh", "en-zh", "zh-en"]`)模式下评测:
233
 
234
  ```bash
235
  python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1
@@ -347,13 +516,13 @@ The summary of multiple domains evaluations can be seen in <a href=#1-multiple-d
347
  | ***bce-embedding-base_v1*** | 768 | `cls` | Free | 57.60 | 65.73 | 74.96 | 69.00 | 57.29 | 38.95 | 59.43 |
348
 
349
  ***NOTE:***
350
- - Our ***bce-embedding-base_v1*** outperforms other opensource embedding models with various model size.
351
  - ***114 datastes*** of **"Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering"** in `["en", "zh", "en-zh", "zh-en"]` setting.
352
  - The [crosslingual evaluation datasets](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/evaluation/c_mteb/Retrieval.py) we released belong to `Retrieval` task.
353
  - More evaluation details please check [Embedding Models Evaluation Summary](https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/EvaluationSummary/embedding_eval_summary.md).
354
 
355
  ***要点:***
356
- - 对比所有开源的各种规模的embedding模型,***bce-embedding-base_v1*** 表现最好。
357
  - 评测包含 **"Retrieval", "STS", "PairClassification", "Classification", "Reranking"和"Clustering"** 这六大类任务的共 ***114个数据集***。
358
  - 我们开源的[跨语种语义表征评测数据](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/evaluation/c_mteb/Retrieval.py)属于`Retrieval`任务。
359
  - 更详细的评测结果详见[Embedding模型指标汇总](https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/EvaluationSummary/embedding_eval_summary.md)。
 
30
  2、RAG优化,适配更多真实业务场景;
31
  3、适配长文本做rerank。
32
 
33
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/64745e955aba8edfb2ed561a/NyV_6ZrsaqUluUnxHKR_m.jpeg)
34
+
35
  -----------------------------------------
36
  <details open="open">
37
  <summary>Click to Open Contents</summary>
 
42
  - <a href="#-model-list" target="_Self">🍎 Model List</a>
43
  - <a href="#-manual" target="_Self">📖 Manual</a>
44
  - <a href="#installation" target="_Self">Installation</a>
45
+ - <a href="#quick-start" target="_Self">Quick Start (`transformers`, `sentence-transformers`)</a>
46
+ - <a href="#integrations-for-rag-frameworks" target="_Self">Integrations for RAG Frameworks (`langchain`, `llama_index`)</a>
47
  - <a href="#%EF%B8%8F-evaluation" target="_Self">⚙️ Evaluation</a>
48
  - <a href="#evaluate-semantic-representation-by-mteb" target="_Self">Evaluate Semantic Representation by MTEB</a>
49
  - <a href="#evaluate-rag-by-llamaindex" target="_Self">Evaluate RAG by LlamaIndex</a>
 
137
  ### Installation
138
 
139
  First, create a conda environment and activate it.
140
+
141
  ```bash
142
  conda create --name bce python=3.10 -y
143
  conda activate bce
144
  ```
145
 
146
+ Then install `BCEmbedding` for minimal installation:
147
+
148
  ```bash
149
+ pip install BCEmbedding==0.1.1
150
  ```
151
 
152
  Or install from source:
153
+
154
  ```bash
155
  git clone [email protected]:netease-youdao/BCEmbedding.git
156
  cd BCEmbedding
 
159
 
160
  ### Quick Start
161
 
162
+ #### 1. Based on `BCEmbedding`
163
+
164
+ Use `EmbeddingModel`, and `cls` [pooler](./BCEmbedding/models/embedding.py#L24) is default.
165
 
166
  ```python
167
  from BCEmbedding import EmbeddingModel
 
176
  embeddings = model.encode(sentences)
177
  ```
178
 
179
+ Use `RerankerModel` to calculate relevant scores and rerank:
180
 
181
  ```python
182
  from BCEmbedding import RerankerModel
 
198
  rerank_results = model.rerank(query, passages)
199
  ```
200
 
201
+ NOTE:
202
+
203
+ - In [`RerankerModel.rerank`](./BCEmbedding/models/reranker.py#L137) method, we provide an advanced preproccess that we use in production for making `sentence_pairs`, when "passages" are very long.
204
+
205
+ #### 2. Based on `transformers`
206
+
207
+ For `EmbeddingModel`:
208
+
209
+ ```python
210
+ from transformers import AutoModel, AutoTokenizer
211
+
212
+ # list of sentences
213
+ sentences = ['sentence_0', 'sentence_1', ...]
214
+
215
+ # init model and tokenizer
216
+ tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-embedding-base_v1')
217
+ model = AutoModel.from_pretrained('maidalun1020/bce-embedding-base_v1')
218
+
219
+ device = 'cuda' # if no GPU, set "cpu"
220
+ model.to(device)
221
+
222
+ # get inputs
223
+ inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt")
224
+ inputs_on_device = {k: v.to(self.device) for k, v in inputs.items()}
225
+
226
+ # get embeddings
227
+ outputs = model(**inputs_on_device, return_dict=True)
228
+ embeddings = outputs.last_hidden_state[:, 0] # cls pooler
229
+ embeddings = embeddings / embeddings.norm(dim=1, keepdim=True) # normalize
230
+ ```
231
+
232
+ For `RerankerModel`:
233
+
234
+ ```python
235
+ import torch
236
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
237
+
238
+ # init model and tokenizer
239
+ tokenizer = AutoTokenizer.from_pretrained('maidalun1020/bce-reranker-base_v1')
240
+ model = AutoModelForSequenceClassification.from_pretrained('maidalun1020/bce-reranker-base_v1')
241
+
242
+ device = 'cuda' # if no GPU, set "cpu"
243
+ model.to(device)
244
+
245
+ # get inputs
246
+ inputs = tokenizer(sentence_pairs, padding=True, truncation=True, max_length=512, return_tensors="pt")
247
+ inputs_on_device = {k: v.to(device) for k, v in inputs.items()}
248
+
249
+ # calculate scores
250
+ scores = model(**inputs_on_device, return_dict=True).logits.view(-1,).float()
251
+ scores = torch.sigmoid(scores)
252
+ ```
253
+
254
+ #### 3. Based on `sentence_transformers`
255
+
256
+ For `EmbeddingModel`:
257
+
258
+ ```python
259
+ from sentence_transformers import SentenceTransformer
260
+
261
+ # list of sentences
262
+ sentences = ['sentence_0', 'sentence_1', ...]
263
+
264
+ # init embedding model
265
+ ## New update for sentence-trnasformers. So clean up your "`SENTENCE_TRANSFORMERS_HOME`/maidalun1020_bce-embedding-base_v1" or "~/.cache/torch/sentence_transformers/maidalun1020_bce-embedding-base_v1" first for downloading new version.
266
+ model = SentenceTransformer("maidalun1020/bce-embedding-base_v1")
267
+
268
+ # extract embeddings
269
+ embeddings = model.encode(sentences, normalize_embeddings=True)
270
+ ```
271
+
272
+ For `RerankerModel`:
273
+
274
+ ```python
275
+ from sentence_transformers import CrossEncoder
276
+
277
+ # init reranker model
278
+ model = CrossEncoder('maidalun1020/bce-reranker-base_v1', max_length=512)
279
+
280
+ # calculate scores of sentence pairs
281
+ scores = model.predict(sentence_pairs)
282
+ ```
283
+
284
+ ### Integrations for RAG Frameworks
285
+
286
+ #### 1. Used in `langchain`
287
+
288
+ ```python
289
+ from langchain.embeddings import HuggingFaceEmbeddings
290
+ from langchain_community.vectorstores import FAISS
291
+ from langchain_community.vectorstores.utils import DistanceStrategy
292
+
293
+ query = 'apples'
294
+ passages = [
295
+ 'I like apples',
296
+ 'I like oranges',
297
+ 'Apples and oranges are fruits'
298
+ ]
299
+
300
+ # init embedding model
301
+ model_name = 'maidalun1020/bce-embedding-base_v1'
302
+ model_kwargs = {'device': 'cuda'}
303
+ encode_kwargs = {'batch_size': 64, 'normalize_embeddings': True, 'show_progress_bar': False}
304
+
305
+ embed_model = HuggingFaceEmbeddings(
306
+ model_name=model_name,
307
+ model_kwargs=model_kwargs,
308
+ encode_kwargs=encode_kwargs
309
+ )
310
+
311
+ # example #1. extract embeddings
312
+ query_embedding = embed_model.embed_query(query)
313
+ passages_embeddings = embed_model.embed_documents(passages)
314
+
315
+ # example #2. langchain retriever example
316
+ faiss_vectorstore = FAISS.from_texts(passages, embed_model, distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT)
317
+
318
+ retriever = faiss_vectorstore.as_retriever(search_type="similarity", search_kwargs={"score_threshold": 0.5, "k": 3})
319
+
320
+ related_passages = retriever.get_relevant_documents(query)
321
+ ```
322
+
323
+ #### 2. Used in `llama_index`
324
+
325
+ ```python
326
+ from llama_index.embeddings import HuggingFaceEmbedding
327
+ from llama_index import VectorStoreIndex, ServiceContext, SimpleDirectoryReader
328
+ from llama_index.node_parser import SimpleNodeParser
329
+ from llama_index.llms import OpenAI
330
+
331
+ query = 'apples'
332
+ passages = [
333
+ 'I like apples',
334
+ 'I like oranges',
335
+ 'Apples and oranges are fruits'
336
+ ]
337
+
338
+ # init embedding model
339
+ model_args = {'model_name': 'maidalun1020/bce-embedding-base_v1', 'max_length': 512, 'embed_batch_size': 64, 'device': 'cuda'}
340
+ embed_model = HuggingFaceEmbedding(**model_args)
341
+
342
+ # example #1. extract embeddings
343
+ query_embedding = embed_model.get_query_embedding(query)
344
+ passages_embeddings = embed_model.get_text_embedding_batch(passages)
345
+
346
+ # example #2. rag example
347
+ llm = OpenAI(model='gpt-3.5-turbo-0613', api_key=os.environ.get('OPENAI_API_KEY'), api_base=os.environ.get('OPENAI_BASE_URL'))
348
+ service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
349
+
350
+ documents = SimpleDirectoryReader(input_files=["BCEmbedding/tools/eval_rag/eval_pdfs/Comp_en_llama2.pdf"]).load_data()
351
+ node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
352
+ nodes = node_parser.get_nodes_from_documents(documents[0:36])
353
+ index = VectorStoreIndex(nodes, service_context=service_context)
354
+ query_engine = index.as_query_engine()
355
+ response = query_engine.query("What is llama?")
356
+ ```
357
+
358
+
359
  ## ⚙️ Evaluation
360
 
361
  ### Evaluate Semantic Representation by MTEB
 
366
 
367
  #### 1. Embedding Models
368
 
369
+ Just run following cmd to evaluate `your_embedding_model` (e.g. `maidalun1020/bce-embedding-base_v1`) in **bilingual and crosslingual settings** (e.g. `["en", "zh", "en-zh", "zh-en"]`).
370
 
371
+ 运行下面命令评测`your_embedding_model`(比如,`maidalun1020/bce-embedding-base_v1`)。评测任务将会在**双语和跨语种**(比如,`["en", "zh", "en-zh", "zh-en"]`)模式下评测:
372
 
373
  ```bash
374
  python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path maidalun1020/bce-embedding-base_v1 --pooler cls
 
379
  评测包含 **"Retrieval", "STS", "PairClassification", "Classification", "Reranking"和"Clustering"** 这六大类任务的 ***114个数据集***。
380
 
381
  ***NOTE:***
382
+ - **All models are evaluated in their recommended pooling method (`pooler`)**.
383
+ - `mean` pooler: "jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large" and "gte-large".
384
+ - `cls` pooler: Other models.
385
  - "jina-embeddings-v2-base-en" model should be loaded with `trust_remote_code`.
386
+
387
  ```bash
388
  python BCEmbedding/tools/eval_mteb/eval_embedding_mteb.py --model_name_or_path {moka-ai/m3e-base | moka-ai/m3e-large} --pooler mean
389
 
 
391
  ```
392
 
393
  ***注意:***
394
+ - 所有模型的评测采用各自推荐的`pooler`。"jina-embeddings-v2-base-en", "m3e-base", "m3e-large", "e5-large-v2", "multilingual-e5-base", "multilingual-e5-large"和"gte-large"的 `pooler`采用`mean`,其他模型的`pooler`采用`cls`.
395
  - "jina-embeddings-v2-base-en"模型在载入时需要`trust_remote_code`。
396
 
397
  #### 2. Reranker Models
398
 
399
+ Run following cmd to evaluate `your_reranker_model` (e.g. "maidalun1020/bce-reranker-base_v1") in **bilingual and crosslingual settings** (e.g. `["en", "zh", "en-zh", "zh-en"]`).
400
 
401
+ 运行下面命令评测`your_reranker_model`(比如,`maidalun1020/bce-reranker-base_v1`)。评测任务将会在 **双语种和跨语种**(比如,`["en", "zh", "en-zh", "zh-en"]`)模式下评测:
402
 
403
  ```bash
404
  python BCEmbedding/tools/eval_mteb/eval_reranker_mteb.py --model_name_or_path maidalun1020/bce-reranker-base_v1
 
516
  | ***bce-embedding-base_v1*** | 768 | `cls` | Free | 57.60 | 65.73 | 74.96 | 69.00 | 57.29 | 38.95 | 59.43 |
517
 
518
  ***NOTE:***
519
+ - Our ***bce-embedding-base_v1*** outperforms other opensource embedding models with comparable model size.
520
  - ***114 datastes*** of **"Retrieval", "STS", "PairClassification", "Classification", "Reranking" and "Clustering"** in `["en", "zh", "en-zh", "zh-en"]` setting.
521
  - The [crosslingual evaluation datasets](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/evaluation/c_mteb/Retrieval.py) we released belong to `Retrieval` task.
522
  - More evaluation details please check [Embedding Models Evaluation Summary](https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/EvaluationSummary/embedding_eval_summary.md).
523
 
524
  ***要点:***
525
+ - 对比其他开源的相同规模的embedding模型,***bce-embedding-base_v1*** 表现最好,效果比最好的large模型稍差。
526
  - 评测包含 **"Retrieval", "STS", "PairClassification", "Classification", "Reranking"和"Clustering"** 这六大类任务的共 ***114个数据集***。
527
  - 我们开源的[跨语种语义表征评测数据](https://github.com/netease-youdao/BCEmbedding/blob/master/BCEmbedding/evaluation/c_mteb/Retrieval.py)属于`Retrieval`任务。
528
  - 更详细的评测结果详见[Embedding模型指标汇总](https://github.com/netease-youdao/BCEmbedding/blob/master/Docs/EvaluationSummary/embedding_eval_summary.md)。