LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
Abstract
Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLMs are Also Effective Embedding Models: An In-depth Overview (2024)
- LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Tasks (2024)
- jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images (2024)
- BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment (2024)
- MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost (2024)
- Break the ID-Language Barrier: An Adaption Framework for Sequential Recommendation (2024)
- A Practical Guide to Fine-tuning Language Models with Limited Data (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper