Text Classification
sentence-transformers
Safetensors
xmod
passage-reranking
antoinelouis commited on
Commit
1fcfa5d
·
verified ·
1 Parent(s): 7a1b8ec

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +241 -0
README.md ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ datasets:
4
+ - ms_marco
5
+ - sentence-transformers/msmarco-hard-negatives
6
+ metrics:
7
+ - recall
8
+ tags:
9
+ - feature-extraction
10
+ - sentence-similarity
11
+ library_name: sentence-transformers
12
+ inference: false
13
+ language:
14
+ - multilingual
15
+ - af
16
+ - am
17
+ - ar
18
+ - az
19
+ - be
20
+ - bg
21
+ - bn
22
+ - ca
23
+ - cs
24
+ - cy
25
+ - da
26
+ - de
27
+ - el
28
+ - en
29
+ - eo
30
+ - es
31
+ - et
32
+ - eu
33
+ - fa
34
+ - fi
35
+ - fr
36
+ - ga
37
+ - gl
38
+ - gu
39
+ - ha
40
+ - he
41
+ - hi
42
+ - hr
43
+ - hu
44
+ - hy
45
+ - id
46
+ - is
47
+ - it
48
+ - ja
49
+ - ka
50
+ - kk
51
+ - km
52
+ - kn
53
+ - ko
54
+ - ku
55
+ - ky
56
+ - la
57
+ - lo
58
+ - lt
59
+ - lv
60
+ - mk
61
+ - ml
62
+ - mn
63
+ - mr
64
+ - ms
65
+ - my
66
+ - ne
67
+ - nl
68
+ - no
69
+ - or
70
+ - pa
71
+ - pl
72
+ - ps
73
+ - pt
74
+ - ro
75
+ - ru
76
+ - sa
77
+ - si
78
+ - sk
79
+ - sl
80
+ - so
81
+ - sq
82
+ - sr
83
+ - sv
84
+ - sw
85
+ - ta
86
+ - te
87
+ - th
88
+ - tl
89
+ - tr
90
+ - uk
91
+ - ur
92
+ - uz
93
+ - vi
94
+ - zh
95
+ ---
96
+
97
+ <h1 align="center">Mono-XM</h1>
98
+
99
+
100
+ <h4 align="center">
101
+ <p>
102
+ <a href=#usage>🛠️ Usage</a> |
103
+ <a href="#evaluation">📊 Evaluation</a> |
104
+ <a href="#train">🤖 Training</a> |
105
+ <a href="#citation">🔗 Citation</a> |
106
+ <a href="https://github.com/ant-louis/xm-retrievers">💻 Code</a>
107
+ <p>
108
+ </h4>
109
+
110
+
111
+ This is a [sentence-transformers](https://www.sbert.net/examples/applications/cross-encoder/README.html) model. It performs cross-attention between a question-passage
112
+ pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search: given a query, encode the latter with some candidate
113
+ passages -- e.g., retrieved with BM25 or a bi-encoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
114
+ The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning
115
+ in a high-resource language, like English, and performs zero-shot transfer to other languages.
116
+
117
+ ## Usage
118
+
119
+ Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
120
+
121
+ #### Using Sentence-Transformers
122
+
123
+ Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
124
+
125
+ ```python
126
+ from sentence_transformers import CrossEncoder
127
+
128
+ pairs = [
129
+ ('Première question', 'Ceci est un paragraphe pertinent.'),
130
+ ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
131
+ ]
132
+ language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
133
+
134
+ model = CrossEncoder('antoinelouis/mono-xm')
135
+ model.model.set_default_language(language_code) #Activate the language-specific adapters
136
+
137
+ scores = model.predict(pairs)
138
+ print(scores)
139
+ ```
140
+
141
+ #### Using FlagEmbedding
142
+
143
+ Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
144
+
145
+ ```python
146
+ from FlagEmbedding import FlagReranker
147
+
148
+ pairs = [
149
+ ('Première question', 'Ceci est un paragraphe pertinent.'),
150
+ ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
151
+ ]
152
+ language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
153
+
154
+ model = FlagReranker('antoinelouis/mono-xm')
155
+ model.model.set_default_language(language_code) #Activate the language-specific adapters
156
+
157
+ scores = model.compute_score(pairs)
158
+ print(scores)
159
+ ```
160
+
161
+ #### Using Transformers
162
+
163
+ Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
164
+
165
+ ```python
166
+ import torch
167
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
168
+
169
+ pairs = [
170
+ ('Première question', 'Ceci est un paragraphe pertinent.'),
171
+ ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
172
+ ]
173
+ language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
174
+
175
+ tokenizer = AutoTokenizer.from_pretrained('antoinelouis/mono-xm')
176
+ model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/mono-xm')
177
+ model.set_default_language(language_code) #Activate the language-specific adapters
178
+
179
+ features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
180
+ with torch.no_grad():
181
+ scores = model(**features).logits
182
+ print(scores)
183
+ ```
184
+
185
+ ***
186
+
187
+ ## Evaluation
188
+
189
+ - **mMARCO**:
190
+ We evaluate the model on the small development sets of [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco), which consists of 6,980 queries for a corpus of 8.8M candidate passages in 14 languages. Below, we compared its multilingual performance with other retrieval models on the dataset official metrics, i.e., mean reciprocal rank at cut-off 10 (MRR@10).
191
+
192
+ | | model | Type | #Samples | #Params | en | es | fr | it | pt | id | de | ru | zh | ja | nl | vi | hi | ar | Avg. |
193
+ |---:|:----------------------------------------------------------------------------------------------------------------------------------------|:--------------|:--------:|:-------:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
194
+ | 1 | BM25 ([Pyserini](https://github.com/castorini/pyserini)) | lexical | - | - | 18.4 | 15.8 | 15.5 | 15.3 | 15.2 | 14.9 | 13.6 | 12.4 | 11.6 | 14.1 | 14.0 | 13.6 | 13.4 | 11.1 | 14.2 |
195
+ | 2 | mono-mT5 ([Bonfacio et al., 2021](https://doi.org/10.48550/arXiv.2108.13897)) | cross-encoder | 12.8M | 390M | 36.6 | 31.4 | 30.2 | 30.3 | 30.2 | 29.8 | 28.9 | 26.3 | 24.9 | 26.7 | 29.2 | 25.6 | 26.6 | 23.5 | 28.6 |
196
+ | 3 | mono-mMiniLM ([Bonfacio et al., 2021](https://doi.org/10.48550/arXiv.2108.13897)) | cross-encoder | 80.0M | 107M | 36.6 | 30.9 | 29.6 | 29.1 | 28.9 | 29.3 | 27.8 | 25.1 | 24.9 | 26.3 | 27.6 | 24.7 | 26.2 | 21.9 | 27.8 |
197
+ | 4 | [DPR-X](https://huggingface.co/eugene-yang/dpr-xlmr-large-mtt-neuclir) ([Yang et al., 2022](https://doi.org/10.48550/arXiv.2204.11989)) | single-vector | 25.6M | 550M | 24.5 | 19.6 | 18.9 | 18.3 | 19.0 | 16.9 | 18.2 | 17.7 | 14.8 | 15.4 | 18.5 | 15.1 | 15.4 | 12.9 | 17.5 |
198
+ | 5 | [mE5-base](https://huggingface.co/intfloat/multilingual-e5-base) ([Wang et al., 2024](https://doi.org/10.48550/arXiv.2402.05672)) | single-vector | 5.1B | 278M | 35.0 | 28.9 | 30.3 | 28.0 | 27.5 | 26.1 | 27.1 | 24.5 | 22.9 | 25.0 | 27.3 | 23.9 | 24.2 | 20.5 | 26.5 |
199
+ | 6 | mColBERT ([Bonfacio et al., 2021](https://doi.org/10.48550/arXiv.2108.13897)) | multi-vector | 25.6M | 180M | 35.2 | 30.1 | 28.9 | 29.2 | 29.2 | 27.5 | 28.1 | 25.0 | 24.6 | 23.6 | 27.3 | 18.0 | 23.2 | 20.9 | 26.5 |
200
+ | | | | | | | | | | | | | | | | | | | | |
201
+ | 7 | [DPR-XM](https://huggingface.co/antoinelouis/dpr-xm) (ours) | single-vector | 25.6M | 277M | 32.7 | 23.6 | 23.5 | 22.3 | 22.7 | 22.0 | 22.1 | 19.9 | 18.1 | 18.7 | 22.9 | 18.0 | 16.0 | 15.1 | 21.3 |
202
+ | 8 | [ColBERT-XM](https://huggingface.co/antoinelouis/colbert-xm) (ours) | multi-vector | 6.4M | 277M | 37.2 | 28.5 | 26.9 | 26.5 | 27.6 | 26.3 | 27.0 | 25.1 | 24.6 | 24.1 | 27.5 | 22.6 | 23.8 | 19.5 | 26.2 |
203
+ | 9 | **Mono-XM** (ours) | cross-encoder | 1.0M | 277M | | | | | | | | | | | | | | | |
204
+
205
+ NB: Evaluation of Mono-XM is not performed by considering the entire corpus but by reranking for each query a set of passages containing one or several positive passages and
206
+ a maximum of 200 negative passages obtained with BM25.
207
+
208
+ ***
209
+
210
+ ## Training
211
+
212
+ #### Data
213
+
214
+ We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains
215
+ 8.8M passages and 539K training queries. We use the BM25 negatives provided by the official dataset and sample 1M (q, p) pairs with a 1/4 positive-to-negative ratio
216
+ (i.e., 250k query-positive pairs for 750k query-negative pairs).
217
+
218
+ #### Implementation
219
+
220
+ The model is initialized from the [xmod-base](https://huggingface.co/facebook/xmod-base) checkpoint and optimized via the binary cross-entropy loss
221
+ (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 5 epochs using the AdamW optimizer with
222
+ a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 10\% of training steps and linear scheduling. We set the maximum sequence
223
+ lengths for the concatenated question-passage pairs to 512 tokens.
224
+
225
+ ***
226
+
227
+ ## Citation
228
+
229
+ ```bibtex
230
+ @article{louis2024modular,
231
+ author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos},
232
+ title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval},
233
+ journal = {CoRR},
234
+ volume = {abs/2402.15059},
235
+ year = {2024},
236
+ url = {https://arxiv.org/abs/2402.15059},
237
+ doi = {10.48550/arXiv.2402.15059},
238
+ eprinttype = {arXiv},
239
+ eprint = {2402.15059},
240
+ }
241
+ ```