BAAI
/

Files changed (1) hide show
  1. README.md +46 -25
README.md CHANGED
@@ -1,10 +1,18 @@
1
- ---
2
- license: mit
3
- ---
4
-
5
-
6
  <h1 align="center">FlagEmbedding</h1>
7
-
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  <h4 align="center">
10
  <p>
@@ -19,18 +27,18 @@ license: mit
19
  <p>
20
  </h4>
21
 
22
- More details please refer to our Github: [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding).
23
-
24
 
25
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
26
 
27
- FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search.
28
- And it also can be used in vector databases for LLMs.
 
 
29
 
30
  ************* 🌟**Updates**🌟 *************
31
- - 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Paper](https://arxiv.org/pdf/2310.07554.pdf) :fire:
32
  - 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
33
- - 09/15/2023: The [masive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
34
  - 09/12/2023: New models:
35
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
36
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
@@ -72,29 +80,27 @@ And it also can be used in vector databases for LLMs.
72
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
73
 
74
 
75
- [1\]: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
76
 
77
- [2\]: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
78
- For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
79
 
80
  All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
81
- If you cannot open the Huggingface Hub, you also can download the models at https://model.baai.ac.cn/models .
82
 
83
 
84
  ## Frequently asked questions
85
 
86
- <details>
87
- <summary>1. How to fine-tune bge embedding model?</summary>
88
 
89
- <!-- ### How to fine-tune bge embedding model? -->
90
  Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
91
  Some suggestions:
92
  - Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
 
93
  - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
94
  - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
95
 
96
-
97
- </details>
98
 
99
  <details>
100
  <summary>2. The similarity score between two dissimilar sentences is higher than 0.5</summary>
@@ -134,7 +140,7 @@ In all cases, the documents/passages do not need to add the instruction.
134
 
135
  ### Usage for Embedding Model
136
 
137
- Here are some examples for using `bge` models with
138
  [FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
139
 
140
  #### Using FlagEmbedding
@@ -366,11 +372,11 @@ See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for
366
 
367
  ### BAAI Embedding
368
 
369
- We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pairs data using contrastive learning.
370
  **You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
371
  We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
372
  Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
373
- More training details for bge see [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
374
 
375
 
376
 
@@ -381,8 +387,14 @@ which is more accurate than embedding model (i.e., bi-encoder) but more time-con
381
  Therefore, it can be used to re-rank the top-k documents returned by embedding model.
382
  We train the cross-encoder on a multilingual pair data,
383
  The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
384
- More details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
 
385
 
 
 
 
 
 
386
 
387
  ## Contact
388
  If you have any question or suggestion related to this project, feel free to open an issue or pull request.
@@ -402,6 +414,15 @@ If you find this repository useful, please consider giving a star :star: and cit
402
  archivePrefix={arXiv},
403
  primaryClass={cs.CL}
404
  }
 
 
 
 
 
 
 
 
 
405
  ```
406
 
407
  ## License
 
 
 
 
 
 
1
  <h1 align="center">FlagEmbedding</h1>
2
+ <p align="center">
3
+ <a href="https://github.com/FlagOpen/FlagEmbedding">
4
+ <img alt="Build" src="https://img.shields.io/badge/Contribution-Welcome-blue">
5
+ </a>
6
+ <a href="https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE">
7
+ <img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green">
8
+ </a>
9
+ <a href="https://huggingface.co/C-MTEB">
10
+ <img alt="Build" src="https://img.shields.io/badge/C_MTEB-🤗-yellow">
11
+ </a>
12
+ <a href="https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding">
13
+ <img alt="Build" src="https://img.shields.io/badge/FlagEmbedding-1.1-red">
14
+ </a>
15
+ </p>
16
 
17
  <h4 align="center">
18
  <p>
 
27
  <p>
28
  </h4>
29
 
 
 
30
 
31
  [English](README.md) | [中文](https://github.com/FlagOpen/FlagEmbedding/blob/master/README_zh.md)
32
 
33
+ <span style="#FF69B4;"> **Hiring:** We're seeking experienced NLP researchers and intern students focusing on dense retrieval and retrieval-augmented LLMs. If you're interested, please feel free to reach out to us via email at zhengliu1026@gmail.com.</span>
34
+
35
+ FlagEmbedding can map any text to a low-dimensional dense vector, which can be used for tasks like retrieval, classification, clustering, and semantic search.
36
+ And it can also be used in vector databases for LLMs.
37
 
38
  ************* 🌟**Updates**🌟 *************
39
+ - 10/12/2023: Release [LLM-Embedder](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/llm_embedder), a unified embedding model to support diverse retrieval augmentation needs for LLMs. [Paper](https://arxiv.org/pdf/2310.07554.pdf) :fire:
40
  - 09/15/2023: The [technical report](https://arxiv.org/pdf/2309.07597.pdf) of BGE has been released
41
+ - 09/15/2023: The [massive training data](https://data.baai.ac.cn/details/BAAI-MTP) of BGE has been released
42
  - 09/12/2023: New models:
43
  - **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
44
  - **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
 
80
  | [BAAI/bge-small-zh](https://huggingface.co/BAAI/bge-small-zh) | Chinese | [Inference](#usage-for-embedding-model) [Fine-tune](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) | a small-scale model but with competitive performance | `为这个句子生成表示以用于检索相关文章:` |
81
 
82
 
83
+ [1\]: If you need to search the relevant passages in a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
84
 
85
+ [2\]: Different from the embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
86
+ For example, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 documents to get the final top-3 results.
87
 
88
  All models have been uploaded to Huggingface Hub, and you can see them at https://huggingface.co/BAAI.
89
+ If you cannot open the Huggingface Hub, you can also download the models at https://model.baai.ac.cn/models .
90
 
91
 
92
  ## Frequently asked questions
93
 
94
+ **1. How to fine-tune bge embedding model?**
 
95
 
 
96
  Following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune) to prepare data and fine-tune your model.
97
  Some suggestions:
98
  - Mine hard negatives following this [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#hard-negatives), which can improve the retrieval performance.
99
+ - In general, larger hyper-parameter `per_device_train_batch_size` brings better performance. You can expand it by enabling `--fp16`, `--deepspeed df_config.json` (df_config.json can refer to [ds_config.json](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/ds_config.json), `--gradient_checkpointing`, etc.
100
  - If you pre-train bge on your data, the pre-trained model cannot be directly used to calculate similarity, and it must be fine-tuned with contrastive learning before computing similarity.
101
  - If the accuracy of the fine-tuned model is still not high, it is recommended to use/fine-tune the cross-encoder model (bge-reranker) to re-rank top-k results. Hard negatives also are needed to fine-tune reranker.
102
 
103
+
 
104
 
105
  <details>
106
  <summary>2. The similarity score between two dissimilar sentences is higher than 0.5</summary>
 
140
 
141
  ### Usage for Embedding Model
142
 
143
+ Here are some examples of using `bge` models with
144
  [FlagEmbedding](#using-flagembedding), [Sentence-Transformers](#using-sentence-transformers), [Langchain](#using-langchain), or [Huggingface Transformers](#using-huggingface-transformers).
145
 
146
  #### Using FlagEmbedding
 
372
 
373
  ### BAAI Embedding
374
 
375
+ We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pair data using contrastive learning.
376
  **You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
377
  We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
378
  Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
379
+ For more training details for bge see [baai_general_embedding](https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/baai_general_embedding/README.md).
380
 
381
 
382
 
 
387
  Therefore, it can be used to re-rank the top-k documents returned by embedding model.
388
  We train the cross-encoder on a multilingual pair data,
389
  The data format is the same as embedding model, so you can fine-tune it easily following our [example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/reranker).
390
+ For more details please refer to [./FlagEmbedding/reranker/README.md](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/reranker)
391
+
392
 
393
+ ### Our Contributors:
394
+
395
+ <a href="https://github.com/FlagOpen/FlagEmbedding/graphs/contributors">
396
+ <img src="https://contrib.rocks/image?repo=FlagOpen/FlagEmbedding" />
397
+ </a>
398
 
399
  ## Contact
400
  If you have any question or suggestion related to this project, feel free to open an issue or pull request.
 
414
  archivePrefix={arXiv},
415
  primaryClass={cs.CL}
416
  }
417
+
418
+ @misc{llm_embedder,
419
+ title={Retrieve Anything To Augment Large Language Models},
420
+ author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
421
+ year={2023},
422
+ eprint={2310.07554},
423
+ archivePrefix={arXiv},
424
+ primaryClass={cs.IR}
425
+ }
426
  ```
427
 
428
  ## License