liuqi6777 commited on
Commit
fca9116
·
verified ·
1 Parent(s): a74bd24

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - ColBERT
7
+ - passage-retrieval
8
+ datasets:
9
+ - ms_marco
10
+ ---
11
+
12
+ <br><br>
13
+
14
+ <p align="center">
15
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
16
+ </p>
17
+
18
+
19
+ <p align="center">
20
+ <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
21
+ </p>
22
+
23
+ # Jina-ColBERT
24
+
25
+ ### Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both _8k context length_ and _fast and accurate retrieval_.
26
+
27
+ [JinaBERT](https://arxiv.org/abs/2310.19923) is a BERT architecture that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length. The Jina-ColBERT model is trained on MSMARCO passage ranking dataset, following a very similar training procedure with ColBERTv2. The only difference is that we use `jina-bert-v2-base-en` as the backbone instead of `bert-base-uncased`.
28
+
29
+ For more information about ColBERT, please refer to the [ColBERTv1](https://arxiv.org/abs/2004.12832) and [ColBERTv2](https://arxiv.org/abs/2112.01488v3) paper, and [the original code](https://github.com/stanford-futuredata/ColBERT).
30
+
31
+ ## Usage
32
+
33
+ We strongly recommend following the same usage as original ColBERT to use this model.
34
+
35
+ ### Installation
36
+
37
+ To use this model, you will need to install the latest version of the ColBERT repository:
38
+
39
+ ```bash
40
+ pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
41
+ ```
42
+
43
+ ### Indexing
44
+
45
+ ```python
46
+ from colbert import Indexer
47
+ from colbert.infra import Run, RunConfig, ColBERTConfig
48
+
49
+ n_gpu: int = 1 # Set your number of available GPUs
50
+ experiment: str = "" # Name of the folder where the logs and created indices will be stored
51
+ index_name: str = "" # The name of your index, i.e. the name of your vector database
52
+
53
+ with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
54
+ config = ColBERTConfig(doc_maxlen=8192) # Our model supports 8k context length for indexing long documents
55
+ indexer = Indexer(
56
+ checkpoint="jinaai/jina-colbert-v1-en",
57
+ config=config,
58
+ )
59
+ documents = [
60
+ "ColBERT is an efficient and effective passage retrieval model.",
61
+ "Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length.",
62
+ ...
63
+ ]
64
+ indexer.index(name=index_name, collection=documents)
65
+ ```
66
+
67
+ ### Searching
68
+
69
+ ```python
70
+ from colbert import Searcher
71
+ from colbert.infra import Run, RunConfig, ColBERTConfig
72
+
73
+ n_gpu: int = 0
74
+ experiment: str = "" # Name of the folder where the logs and created indices will be stored
75
+ index_name: str = "" # Name of your previously created index where the documents you want to search are stored.
76
+ k: int = 10 # how many results you want to retrieve
77
+
78
+ with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
79
+ config = ColBERTConfig(query_maxlen=128) # Although the model supports 8k context length, we suggest not to use a very long query, as it may cause significant computational complexity and CUDA memory usage.
80
+ searcher = Searcher(
81
+ index=index_name,
82
+ config=config
83
+ ) # You don't need to specify checkpoint again, the model name is stored in the index.
84
+ query = "How to use ColBERT for indexing long documents?"
85
+ results = searcher.search(query, k=k)
86
+ # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
87
+ ```
88
+
89
+ ## Evaluation Results
90
+
91
+ **TL;DR:** Our Jina-ColBERT achieves the competitive retrieval performance with [ColBERTv2](https://huggingface.co/colbert-ir/colbertv2.0) on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.
92
+
93
+ ### In-domain benchmarks
94
+
95
+ We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
96
+
97
+ | Model | MRR@10 | Recall@50 | Recall@1k |
98
+ | --- | :---: | :---: | :---: |
99
+ | ColBERTv2 | 39.7 | 86.8 | 97.6 |
100
+ | Jina-ColBERT-v1 | 39.0 | 85.6 | 96.2 |
101
+
102
+ ### Out-of-domain benchmarks
103
+
104
+ Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
105
+
106
+ Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance.
107
+
108
+ | dataset | ColBERTv2 | Jina-ColBERT-v1 |
109
+ | --- | :---: | :---: |
110
+ | ArguAna | 46.5 | 49.4 |
111
+ | ClimateFEVER | 18.1 | 19.6 |
112
+ | DBPedia | 45.2 | 41.3 |
113
+ | FEVER | 78.8 | 79.5 |
114
+ | FiQA | 35.4 | 36.8 |
115
+ | HotPotQA | 67.5 | 65.6 |
116
+ | NFCorpus | 33.7 | 33.8 |
117
+ | NQ | 56.1 | 54.9 |
118
+ | Quora | 85.5 | 82.3 |
119
+ | SCIDOCS | 15.4 | 16.9 |
120
+ | SciFact | 68.9 | 70.1 |
121
+ | TREC-COVID | 72.6 | 75.0 |
122
+ | Webis-touché2020 | 26.0 | 27.0 |
123
+ | Average | 50.0 | 50.2 |
124
+
125
+ ### Long context datasets
126
+
127
+ We also evaluate the zero-shot performance on datasets in where documents have longer context length and compare with some long-context embedding models.
128
+
129
+ | Model | Avg. NDCG@10 | Model max context length | Used context length |
130
+ | --- | :---: | :---: | :---: |
131
+ | ColBERTv2 | 74.3 | 512 | 512 |
132
+ | Jina-ColBERT-v1 | 75.5 | 8192 | 512 |
133
+ | Jina-ColBERT-v1 | 83.7 | 8192 | 8192* |
134
+ | Jina-embeddings-v2-base-en | 85.4 | 8192 | 8192 |
135
+
136
+ \* denotes that we used the context length of 8192 for document but the query length is still 512.
137
+
138
+ **To summarize, Jina-ColBERT achieves the comparable performance with ColBERTv2 on all benchmarks, and outperforms ColBERTv2 on datasets in where documents have longer context length.**
139
+
140
+ ## Plans
141
+
142
+ - We will evaluate the performance of Jina-ColBERT as a reranker in a retrieval pipeline, and add the usage examples.
143
+ - We are planning to improve the performance of Jina-ColBERT by fine-tuning on more datasets in the future!
144
+
145
+ ## Other Models
146
+
147
+ Additionally, we provide the following embedding models, you can also use them for retrieval.
148
+
149
+ - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
150
+ - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
151
+ - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
152
+ - [`jina-embeddings-v2-base-es`](): 161 million parameters Spanish-English bilingual model (soon).
153
+
154
+ ## Contact
155
+
156
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.