File size: 5,573 Bytes

---
pipeline_tag: text-classification
datasets:
- ms_marco
- sentence-transformers/msmarco-hard-negatives
metrics:
- recall
tags:
- passage-reranking
library_name: sentence-transformers
base_model: facebook/xmod-base
inference: false
language: 
- multilingual
- af
- am
- ar
- az
- be
- bg
- bn
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- ga
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- si
- sk
- sl
- so
- sq
- sr
- sv
- sw
- ta
- te
- th
- tl
- tr
- uk
- ur
- uz
- vi
- zh
---

<h1 align="center">Mono-XM</h1>


<h4 align="center">
  <p>
      <a href=#usage>🛠️ Usage</a>  |
      <a href="#evaluation">📊 Evaluation</a> |
      <a href="#train">🤖 Training</a> |
      <a href="#citation">🔗 Citation</a> |
      <a href="https://github.com/ant-louis/xm-retrievers">💻 Code</a>
  <p>
</h4>


This is a **multilingual** cross-encoder model. It performs cross-attention between a question-passage 
pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search: given a query, encode the latter with some candidate 
passages -- e.g., retrieved with BM25 or a bi-encoder -- then sort the passages in a decreasing order of relevance according to the model's predictions. 
The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning 
in a high-resource language, like English, and performs zero-shot transfer to other languages.

## Usage

Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).

#### Using Sentence-Transformers

Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:

```python
from sentence_transformers import CrossEncoder

pairs = [
  ('Première question', 'Ceci est un paragraphe pertinent.'),
  ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages

model = CrossEncoder('antoinelouis/mono-xm')
model.model.set_default_language(language_code) #Activate the language-specific adapters

scores = model.predict(pairs)
print(scores)
```

#### Using FlagEmbedding

Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:

```python
from FlagEmbedding import FlagReranker

pairs = [
  ('Première question', 'Ceci est un paragraphe pertinent.'),
  ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages

model = FlagReranker('antoinelouis/mono-xm')
model.model.set_default_language(language_code) #Activate the language-specific adapters

scores = model.compute_score(pairs)
print(scores)
```

#### Using Transformers

Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

pairs = [
  ('Première question', 'Ceci est un paragraphe pertinent.'),
  ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
]
language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages

tokenizer = AutoTokenizer.from_pretrained('antoinelouis/mono-xm')
model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/mono-xm')
model.set_default_language(language_code) #Activate the language-specific adapters

features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    scores = model(**features).logits
print(scores)
```

***

## Evaluation

[to come...]

***

## Training

#### Data

We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains 
8.8M passages and 539K training queries. We use the BM25 negatives provided by the official dataset and sample 1M (q, p) pairs with a 1/4 positive-to-negative ratio 
(i.e., 250k query-positive pairs for 750k query-negative pairs).

#### Implementation

The model is initialized from the [xmod-base](https://huggingface.co/facebook/xmod-base) checkpoint and optimized via the binary cross-entropy loss 
(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 5 epochs using the AdamW optimizer with 
a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 10\% of training steps and linear scheduling. We set the maximum sequence 
lengths for the concatenated question-passage pairs to 512 tokens.

***

## Citation

```bibtex
@article{louis2024modular,
  author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos},
  title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval},
  journal = {CoRR},
  volume = {abs/2402.15059},
  year = {2024},
  url = {https://arxiv.org/abs/2402.15059},
  doi = {10.48550/arXiv.2402.15059},
  eprinttype = {arXiv},
  eprint = {2402.15059},
}
```