File size: 5,573 Bytes
1fcfa5d 4e95be6 1fcfa5d 4e95be6 1fcfa5d 4e95be6 1fcfa5d 0edfdab 1fcfa5d 00051d7 1fcfa5d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
---
pipeline_tag: text-classification
datasets:
- ms_marco
- sentence-transformers/msmarco-hard-negatives
metrics:
- recall
tags:
- passage-reranking
library_name: sentence-transformers
base_model: facebook/xmod-base
inference: false
language:
- multilingual
- af
- am
- ar
- az
- be
- bg
- bn
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- ga
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- si
- sk
- sl
- so
- sq
- sr
- sv
- sw
- ta
- te
- th
- tl
- tr
- uk
- ur
- uz
- vi
- zh
---
<h1 align="center">Mono-XM</h1>
<h4 align="center">
<p>
<a href=#usage>🛠️ Usage</a> |
<a href="#evaluation">📊 Evaluation</a> |
<a href="#train">🤖 Training</a> |
<a href="#citation">🔗 Citation</a> |
<a href="https://github.com/ant-louis/xm-retrievers">💻 Code</a>
<p>
</h4>
This is a **multilingual** cross-encoder model. It performs cross-attention between a question-passage
pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search: given a query, encode the latter with some candidate
passages -- e.g., retrieved with BM25 or a bi-encoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning
in a high-resource language, like English, and performs zero-shot transfer to other languages.
## Usage
Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
#### Using Sentence-Transformers
Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
```python
from sentence_transformers import CrossEncoder
pairs = [
('Première question', 'Ceci est un paragraphe pertinent.'),
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
model = CrossEncoder('antoinelouis/mono-xm')
model.model.set_default_language(language_code) #Activate the language-specific adapters
scores = model.predict(pairs)
print(scores)
```
#### Using FlagEmbedding
Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
```python
from FlagEmbedding import FlagReranker
pairs = [
('Première question', 'Ceci est un paragraphe pertinent.'),
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
model = FlagReranker('antoinelouis/mono-xm')
model.model.set_default_language(language_code) #Activate the language-specific adapters
scores = model.compute_score(pairs)
print(scores)
```
#### Using Transformers
Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
pairs = [
('Première question', 'Ceci est un paragraphe pertinent.'),
('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/mono-xm')
model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/mono-xm')
model.set_default_language(language_code) #Activate the language-specific adapters
features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
scores = model(**features).logits
print(scores)
```
***
## Evaluation
[to come...]
***
## Training
#### Data
We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains
8.8M passages and 539K training queries. We use the BM25 negatives provided by the official dataset and sample 1M (q, p) pairs with a 1/4 positive-to-negative ratio
(i.e., 250k query-positive pairs for 750k query-negative pairs).
#### Implementation
The model is initialized from the [xmod-base](https://huggingface.co/facebook/xmod-base) checkpoint and optimized via the binary cross-entropy loss
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 5 epochs using the AdamW optimizer with
a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 10\% of training steps and linear scheduling. We set the maximum sequence
lengths for the concatenated question-passage pairs to 512 tokens.
***
## Citation
```bibtex
@article{louis2024modular,
author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos},
title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval},
journal = {CoRR},
volume = {abs/2402.15059},
year = {2024},
url = {https://arxiv.org/abs/2402.15059},
doi = {10.48550/arXiv.2402.15059},
eprinttype = {arXiv},
eprint = {2402.15059},
}
``` |