license: apache-2.0
datasets:
- sigridjineth/korean_nli_dataset_reranker_v0
language:
- ko
base_model:
- answerdotai/answerai-colbert-small-v1
tags:
- colbert
- korean
sigridjineth/colbert-small-korean-20241212
sigridjineth/colbert-small-korean-20241212
is a Korean multi-vector reranker model, fine-tuned from answerai-colbert-small-v1
using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline.
Compared to other ColBERT-based models tested (colbert-ir/colbertv2.0
and answerai/answerai-colbert-small-v1
), sigridjineth/colbert-small-korean-20241212
demonstrates particularly strong results at top_k=3
, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).
Model Comparison
The AutoRAG Benchmark serves as both the evaluation dataset and the toolkit for reporting these metrics.
Model | top_k | F1 | MRR | NDCG |
---|---|---|---|---|
colbert-ir/colbertv2.0 | 1 | 0.2456 | 0.2456 | 0.2456 |
3 | 0.3596 | 0.4459 | 0.5158 | |
5 | 0.3596 | 0.4459 | 0.5158 | |
answerai/answerai-colbert-small-v1 | 1 | 0.2193 | 0.2193 | 0.2193 |
3 | 0.3596 | 0.4240 | 0.4992 | |
5 | 0.3596 | 0.4240 | 0.4992 | |
sigridjineth/colbert-small-korean-20241212 | 1 | 0.3772 | 0.3772 | 0.3772 |
3 | 0.3596 | 0.5278 | 0.5769 | |
5 | 0.3596 | 0.5278 | 0.5769 |
Usage
Installation
This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries:
pip install --upgrade ragatouille
pip install --upgrade colbert-ai
pip install --upgrade rerankers[transformers]
Using rerankers
from rerankers import Reranker
ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert')
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...']
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?'
ranked_docs = ranker.rank(query=query, docs=docs)
Using AGatouille
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212")
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...']
RAG.index(docs, index_name="korean_cinema")
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?'
results = RAG.search(query)
Using Stanford ColBERT
Indexing:
from colbert import Indexer
from colbert.infra import ColBERTConfig
INDEX_NAME = "KO_MOVIES_INDEX"
config = ColBERTConfig(doc_maxlen=512, nbits=2)
indexer = Indexer(
checkpoint="sigridjineth/colbert-small-korean-20241212",
config=config
)
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...']
indexer.index(name=INDEX_NAME, collection=docs)
Querying:
from colbert import Searcher
from colbert.infra import ColBERTConfig
config = ColBERTConfig(query_maxlen=32)
searcher = Searcher(index=INDEX_NAME, config=config)
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?'
results = searcher.search(query, k=10)
Extracting Vectors:
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig())
embedded_query = ckpt.queryFromText(["νμΈμ μμ§μ΄λ μ± μμ΄ λλΉμ μ°Έμ¬ν μ±μ°λ λꡬμΈκ°?"], bsize=16)
Referencing
If you use this model or other JaColBERTv2.5-based models, please cite:
@article{clavie2024jacolbertv2,
title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources},
author={Clavi{\'e}, Benjamin},
journal={arXiv preprint arXiv:2407.20750},
year={2024}
}