antoinelouis
/

mono-xm

+---
+pipeline_tag: sentence-similarity
+datasets:
+- ms_marco
+- sentence-transformers/msmarco-hard-negatives
+metrics:
+- recall
+tags:
+- feature-extraction
+- sentence-similarity
+library_name: sentence-transformers
+inference: false
+language:
+- multilingual
+- af
+- am
+- ar
+- az
+- be
+- bg
+- bn
+- ca
+- cs
+- cy
+- da
+- de
+- el
+- en
+- eo
+- es
+- et
+- eu
+- fa
+- fi
+- fr
+- ga
+- gl
+- gu
+- ha
+- he
+- hi
+- hr
+- hu
+- hy
+- id
+- is
+- it
+- ja
+- ka
+- kk
+- km
+- kn
+- ko
+- ku
+- ky
+- la
+- lo
+- lt
+- lv
+- mk
+- ml
+- mn
+- mr
+- ms
+- my
+- ne
+- nl
+- no
+- or
+- pa
+- pl
+- ps
+- pt
+- ro
+- ru
+- sa
+- si
+- sk
+- sl
+- so
+- sq
+- sr
+- sv
+- sw
+- ta
+- te
+- th
+- tl
+- tr
+- uk
+- ur
+- uz
+- vi
+- zh
+---
+<h1 align="center">Mono-XM</h1>
+<h4 align="center">
+  <p>
+      <a href=#usage>🛠️ Usage</a>  |
+      <a href="#evaluation">📊 Evaluation</a> |
+      <a href="#train">🤖 Training</a> |
+      <a href="#citation">🔗 Citation</a> |
+      <a href="https://github.com/ant-louis/xm-retrievers">💻 Code</a>
+  <p>
+</h4>
+This is a [sentence-transformers](https://www.sbert.net/examples/applications/cross-encoder/README.html) model. It performs cross-attention between a question-passage
+pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search: given a query, encode the latter with some candidate
+passages -- e.g., retrieved with BM25 or a bi-encoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
+The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning
+in a high-resource language, like English, and performs zero-shot transfer to other languages.
+## Usage
+Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
+#### Using Sentence-Transformers
+Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
+```python
+from sentence_transformers import CrossEncoder
+pairs = [
+  ('Première question', 'Ceci est un paragraphe pertinent.'),
+  ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
+]
+language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
+model = CrossEncoder('antoinelouis/mono-xm')
+model.model.set_default_language(language_code) #Activate the language-specific adapters
+scores = model.predict(pairs)
+print(scores)
+```
+#### Using FlagEmbedding
+Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
+```python
+from FlagEmbedding import FlagReranker
+pairs = [
+  ('Première question', 'Ceci est un paragraphe pertinent.'),
+  ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
+]
+language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
+model = FlagReranker('antoinelouis/mono-xm')
+model.model.set_default_language(language_code) #Activate the language-specific adapters
+scores = model.compute_score(pairs)
+print(scores)
+```
+#### Using Transformers
+Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+pairs = [
+  ('Première question', 'Ceci est un paragraphe pertinent.'),
+  ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'),
+]
+language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages
+tokenizer = AutoTokenizer.from_pretrained('antoinelouis/mono-xm')
+model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/mono-xm')
+model.set_default_language(language_code) #Activate the language-specific adapters
+features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
+with torch.no_grad():
+    scores = model(**features).logits
+print(scores)
+```
+***
+## Evaluation
+- **mMARCO**:
+We evaluate the model on the small development sets of [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco), which consists of 6,980 queries for a corpus of 8.8M candidate passages in 14 languages. Below, we compared its multilingual performance with other retrieval models on the dataset official metrics, i.e., mean reciprocal rank at cut-off 10 (MRR@10).
+|    | model                                                                                                                                   |          Type | #Samples | #Params |   en |   es |   fr |   it |   pt |   id |   de |   ru |   zh |   ja |   nl |   vi |   hi |   ar | Avg. |
+|---:|:----------------------------------------------------------------------------------------------------------------------------------------|:--------------|:--------:|:-------:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|-----:|
+|  1 | BM25 ([Pyserini](https://github.com/castorini/pyserini))                                                                                |       lexical |        - |       - | 18.4 | 15.8 | 15.5 | 15.3 | 15.2 | 14.9 | 13.6 | 12.4 | 11.6 | 14.1 | 14.0 | 13.6 | 13.4 | 11.1 | 14.2 |
+|  2 | mono-mT5 ([Bonfacio et al., 2021](https://doi.org/10.48550/arXiv.2108.13897))                                                           | cross-encoder |    12.8M |    390M | 36.6 | 31.4 | 30.2 | 30.3 | 30.2 | 29.8 | 28.9 | 26.3 | 24.9 | 26.7 | 29.2 | 25.6 | 26.6 | 23.5 | 28.6 |
+|  3 | mono-mMiniLM ([Bonfacio et al., 2021](https://doi.org/10.48550/arXiv.2108.13897))                                                       | cross-encoder |    80.0M |    107M | 36.6 | 30.9 | 29.6 | 29.1 | 28.9 | 29.3 | 27.8 | 25.1 | 24.9 | 26.3 | 27.6 | 24.7 | 26.2 | 21.9 | 27.8 |
+|  4 | [DPR-X](https://huggingface.co/eugene-yang/dpr-xlmr-large-mtt-neuclir) ([Yang et al., 2022](https://doi.org/10.48550/arXiv.2204.11989)) | single-vector |    25.6M |    550M | 24.5 | 19.6 | 18.9 | 18.3 | 19.0 | 16.9 | 18.2 | 17.7 | 14.8 | 15.4 | 18.5 | 15.1 | 15.4 | 12.9 | 17.5 |
+|  5 | [mE5-base](https://huggingface.co/intfloat/multilingual-e5-base) ([Wang et al., 2024](https://doi.org/10.48550/arXiv.2402.05672))       | single-vector |     5.1B |    278M | 35.0 | 28.9 | 30.3 | 28.0 | 27.5 | 26.1 | 27.1 | 24.5 | 22.9 | 25.0 | 27.3 | 23.9 | 24.2 | 20.5 | 26.5 |
+|  6 | mColBERT ([Bonfacio et al., 2021](https://doi.org/10.48550/arXiv.2108.13897))                                                           |  multi-vector |    25.6M |    180M | 35.2 | 30.1 | 28.9 | 29.2 | 29.2 | 27.5 | 28.1 | 25.0 | 24.6 | 23.6 | 27.3 | 18.0 | 23.2 | 20.9 | 26.5 |
+|    |                                                                                                                                         |               |          |         |      |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
+|  7 | [DPR-XM](https://huggingface.co/antoinelouis/dpr-xm) (ours)                                                                             | single-vector |    25.6M |    277M | 32.7 | 23.6 | 23.5 | 22.3 | 22.7 | 22.0 | 22.1 | 19.9 | 18.1 | 18.7 | 22.9 | 18.0 | 16.0 | 15.1 | 21.3 |
+|  8 | [ColBERT-XM](https://huggingface.co/antoinelouis/colbert-xm) (ours)                                                                     |  multi-vector |     6.4M |    277M | 37.2 | 28.5 | 26.9 | 26.5 | 27.6 | 26.3 | 27.0 | 25.1 | 24.6 | 24.1 | 27.5 | 22.6 | 23.8 | 19.5 | 26.2 |
+|  9 | **Mono-XM** (ours)                                                                                                                      | cross-encoder |     1.0M |    277M |  |      |      |      |      |      |      |      |      |      |      |      |      |      |      |
+NB: Evaluation of Mono-XM is not performed by considering the entire corpus but by reranking for each query a set of passages containing one or several positive passages and
+a maximum of 200 negative passages obtained with BM25.
+***
+## Training
+#### Data
+We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains
+8.8M passages and 539K training queries. We use the BM25 negatives provided by the official dataset and sample 1M (q, p) pairs with a 1/4 positive-to-negative ratio
+(i.e., 250k query-positive pairs for 750k query-negative pairs).
+#### Implementation
+The model is initialized from the [xmod-base](https://huggingface.co/facebook/xmod-base) checkpoint and optimized via the binary cross-entropy loss
+(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 5 epochs using the AdamW optimizer with
+a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 10\% of training steps and linear scheduling. We set the maximum sequence
+lengths for the concatenated question-passage pairs to 512 tokens.
+***
+## Citation
+```bibtex
+@article{louis2024modular,
+  author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos},
+  title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval},
+  journal = {CoRR},
+  volume = {abs/2402.15059},
+  year = {2024},
+  url = {https://arxiv.org/abs/2402.15059},
+  doi = {10.48550/arXiv.2402.15059},
+  eprinttype = {arXiv},
+  eprint = {2402.15059},
+}
+```