File size: 4,632 Bytes

---
license: apache-2.0
datasets:
- sigridjineth/korean_nli_dataset_reranker_v0
language:
- ko
base_model:
- answerdotai/answerai-colbert-small-v1
tags:
- colbert
- korean
---

# sigridjineth/colbert-small-korean-20241212

`sigridjineth/colbert-small-korean-20241212` is a Korean multi-vector reranker model, fine-tuned from `answerai-colbert-small-v1` using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline.

Compared to other ColBERT-based models tested (`colbert-ir/colbertv2.0` and `answerai/answerai-colbert-small-v1`), `sigridjineth/colbert-small-korean-20241212` demonstrates particularly strong results at `top_k=3`, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).

## Model Comparison
The [AutoRAG Benchmark](https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark) serves as both the evaluation dataset and the toolkit for reporting these metrics.

| Model                                     | top_k | F1     | MRR     | NDCG    |
|-------------------------------------------|-------|--------|---------|---------|
| colbert-ir/colbertv2.0                    | 1     | 0.2456 | 0.2456  | 0.2456  |
|                                           | 3     | 0.3596 | 0.4459  | 0.5158  |
|                                           | 5     | 0.3596 | 0.4459  | 0.5158  |
| answerai/answerai-colbert-small-v1        | 1     | 0.2193 | 0.2193  | 0.2193  |
|                                           | 3     | 0.3596 | 0.4240  | 0.4992  |
|                                           | 5     | 0.3596 | 0.4240  | 0.4992  |
| sigridjineth/colbert-small-korean-20241212| 1     | 0.3772 | 0.3772  | 0.3772  |
|                                           | 3     | 0.3596 | **0.5278** | **0.5769** |
|                                           | 5     | 0.3596 | 0.5278  | 0.5769  |                             |

## Usage

### Installation

This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries:

```bash
pip install --upgrade ragatouille
pip install --upgrade colbert-ai
pip install --upgrade rerankers[transformers]
```

### Using rerankers

```python
from rerankers import Reranker

ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert')
docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']
query = '센과 치히로의 행방불명을 누가 감독했나요?'
ranked_docs = ranker.rank(query=query, docs=docs)
```

### Using AGatouille

```python
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212")
docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']

RAG.index(docs, index_name="korean_cinema")

query = '센과 치히로의 행방불명을 누가 감독했나요?'
results = RAG.search(query)
```

### Using Stanford ColBERT

**Indexing:**
```python
from colbert import Indexer
from colbert.infra import ColBERTConfig

INDEX_NAME = "KO_MOVIES_INDEX"
config = ColBERTConfig(doc_maxlen=512, nbits=2)

indexer = Indexer(
    checkpoint="sigridjineth/colbert-small-korean-20241212",
    config=config
)

docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']
indexer.index(name=INDEX_NAME, collection=docs)
```

**Querying:**
```python
from colbert import Searcher
from colbert.infra import ColBERTConfig

config = ColBERTConfig(query_maxlen=32)
searcher = Searcher(index=INDEX_NAME, config=config)

query = '센과 치히로의 행방불명을 누가 감독했나요?'
results = searcher.search(query, k=10)
```

**Extracting Vectors:**
```python
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig())
embedded_query = ckpt.queryFromText(["하울의 움직이는 성 영어 더빙에 참여한 성우는 누구인가?"], bsize=16)
```

## Referencing

If you use this model or other JaColBERTv2.5-based models, please cite:

```bibtex
@article{clavie2024jacolbertv2,
  title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources},
  author={Clavi{\'e}, Benjamin},
  journal={arXiv preprint arXiv:2407.20750},
  year={2024}
}
```