File size: 4,632 Bytes
1d88f5e
 
 
 
 
 
 
 
 
 
 
 
16a963a
1d88f5e
 
 
 
 
 
 
 
 
c687746
 
 
 
 
 
 
 
 
 
 
1d88f5e
 
 
 
 
 
 
 
 
 
 
 
 
a88f03e
1d88f5e
 
 
 
 
 
 
 
 
 
a88f03e
1d88f5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: apache-2.0
datasets:
- sigridjineth/korean_nli_dataset_reranker_v0
language:
- ko
base_model:
- answerdotai/answerai-colbert-small-v1
tags:
- colbert
- korean
---

# sigridjineth/colbert-small-korean-20241212

`sigridjineth/colbert-small-korean-20241212` is a Korean multi-vector reranker model, fine-tuned from `answerai-colbert-small-v1` using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline.

Compared to other ColBERT-based models tested (`colbert-ir/colbertv2.0` and `answerai/answerai-colbert-small-v1`), `sigridjineth/colbert-small-korean-20241212` demonstrates particularly strong results at `top_k=3`, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).

## Model Comparison
The [AutoRAG Benchmark](https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark) serves as both the evaluation dataset and the toolkit for reporting these metrics.

| Model                                     | top_k | F1     | MRR     | NDCG    |
|-------------------------------------------|-------|--------|---------|---------|
| colbert-ir/colbertv2.0                    | 1     | 0.2456 | 0.2456  | 0.2456  |
|                                           | 3     | 0.3596 | 0.4459  | 0.5158  |
|                                           | 5     | 0.3596 | 0.4459  | 0.5158  |
| answerai/answerai-colbert-small-v1        | 1     | 0.2193 | 0.2193  | 0.2193  |
|                                           | 3     | 0.3596 | 0.4240  | 0.4992  |
|                                           | 5     | 0.3596 | 0.4240  | 0.4992  |
| sigridjineth/colbert-small-korean-20241212| 1     | 0.3772 | 0.3772  | 0.3772  |
|                                           | 3     | 0.3596 | **0.5278** | **0.5769** |
|                                           | 5     | 0.3596 | 0.5278  | 0.5769  |                             |

## Usage

### Installation

This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries:

```bash
pip install --upgrade ragatouille
pip install --upgrade colbert-ai
pip install --upgrade rerankers[transformers]
```

### Using rerankers

```python
from rerankers import Reranker

ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert')
docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']
query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
ranked_docs = ranker.rank(query=query, docs=docs)
```

### Using AGatouille

```python
from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212")
docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']

RAG.index(docs, index_name="korean_cinema")

query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
results = RAG.search(query)
```

### Using Stanford ColBERT

**Indexing:**
```python
from colbert import Indexer
from colbert.infra import ColBERTConfig

INDEX_NAME = "KO_MOVIES_INDEX"
config = ColBERTConfig(doc_maxlen=512, nbits=2)

indexer = Indexer(
    checkpoint="sigridjineth/colbert-small-korean-20241212",
    config=config
)

docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']
indexer.index(name=INDEX_NAME, collection=docs)
```

**Querying:**
```python
from colbert import Searcher
from colbert.infra import ColBERTConfig

config = ColBERTConfig(query_maxlen=32)
searcher = Searcher(index=INDEX_NAME, config=config)

query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
results = searcher.search(query, k=10)
```

**Extracting Vectors:**
```python
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig())
embedded_query = ckpt.queryFromText(["ν•˜μšΈμ˜ μ›€μ§μ΄λŠ” μ„± μ˜μ–΄ 더빙에 μ°Έμ—¬ν•œ μ„±μš°λŠ” λˆ„κ΅¬μΈκ°€?"], bsize=16)
```

## Referencing

If you use this model or other JaColBERTv2.5-based models, please cite:

```bibtex
@article{clavie2024jacolbertv2,
  title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources},
  author={Clavi{\'e}, Benjamin},
  journal={arXiv preprint arXiv:2407.20750},
  year={2024}
}
```