--- pipeline_tag: text-classification datasets: - ms_marco - sentence-transformers/msmarco-hard-negatives metrics: - recall tags: - passage-reranking library_name: sentence-transformers base_model: facebook/xmod-base inference: false language: - multilingual - af - am - ar - az - be - bg - bn - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - fi - fr - ga - gl - gu - ha - he - hi - hr - hu - hy - id - is - it - ja - ka - kk - km - kn - ko - ku - ky - la - lo - lt - lv - mk - ml - mn - mr - ms - my - ne - nl - no - or - pa - pl - ps - pt - ro - ru - sa - si - sk - sl - so - sq - sr - sv - sw - ta - te - th - tl - tr - uk - ur - uz - vi - zh ---
🛠️ Usage | 📊 Evaluation | 🤖 Training | 🔗 Citation | 💻 Code
This is a **multilingual** cross-encoder model. It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model should be used as a reranker for semantic search: given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a bi-encoder -- then sort the passages in a decreasing order of relevance according to the model's predictions. The model uses an [XMOD](https://huggingface.co/facebook/xmod-base) backbone, which allows it to learn from monolingual fine-tuning in a high-resource language, like English, and performs zero-shot transfer to other languages. ## Usage Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers). #### Using Sentence-Transformers Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this: ```python from sentence_transformers import CrossEncoder pairs = [ ('Première question', 'Ceci est un paragraphe pertinent.'), ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'), ] language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages model = CrossEncoder('antoinelouis/mono-xm') model.model.set_default_language(language_code) #Activate the language-specific adapters scores = model.predict(pairs) print(scores) ``` #### Using FlagEmbedding Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this: ```python from FlagEmbedding import FlagReranker pairs = [ ('Première question', 'Ceci est un paragraphe pertinent.'), ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'), ] language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages model = FlagReranker('antoinelouis/mono-xm') model.model.set_default_language(language_code) #Activate the language-specific adapters scores = model.compute_score(pairs) print(scores) ``` #### Using Transformers Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this: ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification pairs = [ ('Première question', 'Ceci est un paragraphe pertinent.'), ('Voici une autre requête', 'Et voilà un paragraphe non pertinent.'), ] language_code = "fr_FR" #Find all codes here: https://huggingface.co/facebook/xmod-base#languages tokenizer = AutoTokenizer.from_pretrained('antoinelouis/mono-xm') model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/mono-xm') model.set_default_language(language_code) #Activate the language-specific adapters features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt') with torch.no_grad(): scores = model(**features).logits print(scores) ``` *** ## Evaluation [to come...] *** ## Training #### Data We use the English training samples from the [MS MARCO passage ranking](https://ir-datasets.com/msmarco-passage.html#msmarco-passage/train) dataset, which contains 8.8M passages and 539K training queries. We use the BM25 negatives provided by the official dataset and sample 1M (q, p) pairs with a 1/4 positive-to-negative ratio (i.e., 250k query-positive pairs for 750k query-negative pairs). #### Implementation The model is initialized from the [xmod-base](https://huggingface.co/facebook/xmod-base) checkpoint and optimized via the binary cross-entropy loss (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 5 epochs using the AdamW optimizer with a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 10\% of training steps and linear scheduling. We set the maximum sequence lengths for the concatenated question-passage pairs to 512 tokens. *** ## Citation ```bibtex @article{louis2024modular, author = {Louis, Antoine and Saxena, Vageesh and van Dijck, Gijs and Spanakis, Gerasimos}, title = {ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval}, journal = {CoRR}, volume = {abs/2402.15059}, year = {2024}, url = {https://arxiv.org/abs/2402.15059}, doi = {10.48550/arXiv.2402.15059}, eprinttype = {arXiv}, eprint = {2402.15059}, } ```