liuqi6777 commited on
Commit
43cade0
·
verified ·
1 Parent(s): fca9116

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -7
README.md CHANGED
@@ -30,7 +30,7 @@ For more information about ColBERT, please refer to the [ColBERTv1](https://arxi
30
 
31
  ## Usage
32
 
33
- We strongly recommend following the same usage as original ColBERT to use this model.
34
 
35
  ### Installation
36
 
@@ -51,7 +51,9 @@ experiment: str = "" # Name of the folder where the logs and created indices wi
51
  index_name: str = "" # The name of your index, i.e. the name of your vector database
52
 
53
  with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
54
- config = ColBERTConfig(doc_maxlen=8192) # Our model supports 8k context length for indexing long documents
 
 
55
  indexer = Indexer(
56
  checkpoint="jinaai/jina-colbert-v1-en",
57
  config=config,
@@ -76,11 +78,13 @@ index_name: str = "" # Name of your previously created index where the document
76
  k: int = 10 # how many results you want to retrieve
77
 
78
  with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
79
- config = ColBERTConfig(query_maxlen=128) # Although the model supports 8k context length, we suggest not to use a very long query, as it may cause significant computational complexity and CUDA memory usage.
 
 
80
  searcher = Searcher(
81
  index=index_name,
82
  config=config
83
- ) # You don't need to specify checkpoint again, the model name is stored in the index.
84
  query = "How to use ColBERT for indexing long documents?"
85
  results = searcher.search(query, k=k)
86
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
@@ -92,7 +96,7 @@ with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
92
 
93
  ### In-domain benchmarks
94
 
95
- We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
96
 
97
  | Model | MRR@10 | Recall@50 | Recall@1k |
98
  | --- | :---: | :---: | :---: |
@@ -101,7 +105,7 @@ We evaluate the in-domain performance on the dev subset of MSMARCO passage ranki
101
 
102
  ### Out-of-domain benchmarks
103
 
104
- Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
105
 
106
  Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance.
107
 
@@ -124,7 +128,7 @@ Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking
124
 
125
  ### Long context datasets
126
 
127
- We also evaluate the zero-shot performance on datasets in where documents have longer context length and compare with some long-context embedding models.
128
 
129
  | Model | Avg. NDCG@10 | Model max context length | Used context length |
130
  | --- | :---: | :---: | :---: |
 
30
 
31
  ## Usage
32
 
33
+ We strongly recommend following the same usage as the original ColBERT to use this model.
34
 
35
  ### Installation
36
 
 
51
  index_name: str = "" # The name of your index, i.e. the name of your vector database
52
 
53
  with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
54
+ config = ColBERTConfig(
55
+ doc_maxlen=8192 # Our model supports 8k context length for indexing long documents
56
+ )
57
  indexer = Indexer(
58
  checkpoint="jinaai/jina-colbert-v1-en",
59
  config=config,
 
78
  k: int = 10 # how many results you want to retrieve
79
 
80
  with Run().context(RunConfig(nranks=n_gpu, experiment=experiment)):
81
+ config = ColBERTConfig(
82
+ query_maxlen=128 # Although the model supports 8k context length, we suggest not to use a very long query, as it may cause significant computational complexity and CUDA memory usage.
83
+ )
84
  searcher = Searcher(
85
  index=index_name,
86
  config=config
87
+ ) # You don't need to specify the checkpoint again, the model name is stored in the index.
88
  query = "How to use ColBERT for indexing long documents?"
89
  results = searcher.search(query, k=k)
90
  # results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
 
96
 
97
  ### In-domain benchmarks
98
 
99
+ We evaluate the in-domain performance on the dev subset of MSMARCO passage ranking dataset. We follow the same evaluation settings in the ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
100
 
101
  | Model | MRR@10 | Recall@50 | Recall@1k |
102
  | --- | :---: | :---: | :---: |
 
105
 
106
  ### Out-of-domain benchmarks
107
 
108
+ Following ColBERTv2, we evaluate the out-of-domain performance on 13 public BEIR datasets and use NDCG@10 as the main metric. We follow the same evaluation settings in the ColBERTv2 paper and rerun the results of ColBERTv2 using the released checkpoint.
109
 
110
  Note that both ColBERTv2 and Jina-ColBERT-v1 only employ MSMARCO passage ranking dataset for training, so below results are the fully zero-shot performance.
111
 
 
128
 
129
  ### Long context datasets
130
 
131
+ We also evaluate the zero-shot performance on datasets where documents have longer context length and compare with some long-context embedding models. Here we use the [LoCo benchmark](https://www.together.ai/blog/long-context-retrieval-models-with-monarch-mixer), which contains 5 datasets with long context length.
132
 
133
  | Model | Avg. NDCG@10 | Model max context length | Used context length |
134
  | --- | :---: | :---: | :---: |