sigridjineth's picture
Update README.md
c687746 verified
metadata
license: apache-2.0
datasets:
  - sigridjineth/korean_nli_dataset_reranker_v0
language:
  - ko
base_model:
  - answerdotai/answerai-colbert-small-v1
tags:
  - colbert
  - korean

sigridjineth/colbert-small-korean-20241212

sigridjineth/colbert-small-korean-20241212 is a Korean multi-vector reranker model, fine-tuned from answerai-colbert-small-v1 using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline.

Compared to other ColBERT-based models tested (colbert-ir/colbertv2.0 and answerai/answerai-colbert-small-v1), sigridjineth/colbert-small-korean-20241212 demonstrates particularly strong results at top_k=3, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).

Model Comparison

The AutoRAG Benchmark serves as both the evaluation dataset and the toolkit for reporting these metrics.

Model top_k F1 MRR NDCG
colbert-ir/colbertv2.0 1 0.2456 0.2456 0.2456
3 0.3596 0.4459 0.5158
5 0.3596 0.4459 0.5158
answerai/answerai-colbert-small-v1 1 0.2193 0.2193 0.2193
3 0.3596 0.4240 0.4992
5 0.3596 0.4240 0.4992
sigridjineth/colbert-small-korean-20241212 1 0.3772 0.3772 0.3772
3 0.3596 0.5278 0.5769
5 0.3596 0.5278 0.5769

Usage

Installation

This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries:

pip install --upgrade ragatouille
pip install --upgrade colbert-ai
pip install --upgrade rerankers[transformers]

Using rerankers

from rerankers import Reranker

ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert')
docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']
query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
ranked_docs = ranker.rank(query=query, docs=docs)

Using AGatouille

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212")
docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']

RAG.index(docs, index_name="korean_cinema")

query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
results = RAG.search(query)

Using Stanford ColBERT

Indexing:

from colbert import Indexer
from colbert.infra import ColBERTConfig

INDEX_NAME = "KO_MOVIES_INDEX"
config = ColBERTConfig(doc_maxlen=512, nbits=2)

indexer = Indexer(
    checkpoint="sigridjineth/colbert-small-korean-20241212",
    config=config
)

docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']
indexer.index(name=INDEX_NAME, collection=docs)

Querying:

from colbert import Searcher
from colbert.infra import ColBERTConfig

config = ColBERTConfig(query_maxlen=32)
searcher = Searcher(index=INDEX_NAME, config=config)

query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
results = searcher.search(query, k=10)

Extracting Vectors:

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig())
embedded_query = ckpt.queryFromText(["ν•˜μšΈμ˜ μ›€μ§μ΄λŠ” μ„± μ˜μ–΄ 더빙에 μ°Έμ—¬ν•œ μ„±μš°λŠ” λˆ„κ΅¬μΈκ°€?"], bsize=16)

Referencing

If you use this model or other JaColBERTv2.5-based models, please cite:

@article{clavie2024jacolbertv2,
  title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources},
  author={Clavi{\'e}, Benjamin},
  journal={arXiv preprint arXiv:2407.20750},
  year={2024}
}