File size: 4,632 Bytes
1d88f5e 16a963a 1d88f5e c687746 1d88f5e a88f03e 1d88f5e a88f03e 1d88f5e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
license: apache-2.0
datasets:
- sigridjineth/korean_nli_dataset_reranker_v0
language:
- ko
base_model:
- answerdotai/answerai-colbert-small-v1
tags:
- colbert
- korean
---
# sigridjineth/colbert-small-korean-20241212
`sigridjineth/colbert-small-korean-20241212` is a Korean multi-vector reranker model, fine-tuned from `answerai-colbert-small-v1` using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline.
Compared to other ColBERT-based models tested (`colbert-ir/colbertv2.0` and `answerai/answerai-colbert-small-v1`), `sigridjineth/colbert-small-korean-20241212` demonstrates particularly strong results at `top_k=3`, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).
## Model Comparison
The [AutoRAG Benchmark](https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark) serves as both the evaluation dataset and the toolkit for reporting these metrics.
| Model | top_k | F1 | MRR | NDCG |
|-------------------------------------------|-------|--------|---------|---------|
| colbert-ir/colbertv2.0 | 1 | 0.2456 | 0.2456 | 0.2456 |
| | 3 | 0.3596 | 0.4459 | 0.5158 |
| | 5 | 0.3596 | 0.4459 | 0.5158 |
| answerai/answerai-colbert-small-v1 | 1 | 0.2193 | 0.2193 | 0.2193 |
| | 3 | 0.3596 | 0.4240 | 0.4992 |
| | 5 | 0.3596 | 0.4240 | 0.4992 |
| sigridjineth/colbert-small-korean-20241212| 1 | 0.3772 | 0.3772 | 0.3772 |
| | 3 | 0.3596 | **0.5278** | **0.5769** |
| | 5 | 0.3596 | 0.5278 | 0.5769 | |
## Usage
### Installation
This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries:
```bash
pip install --upgrade ragatouille
pip install --upgrade colbert-ai
pip install --upgrade rerankers[transformers]
```
### Using rerankers
```python
from rerankers import Reranker
ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert')
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...']
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?'
ranked_docs = ranker.rank(query=query, docs=docs)
```
### Using AGatouille
```python
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212")
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...']
RAG.index(docs, index_name="korean_cinema")
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?'
results = RAG.search(query)
```
### Using Stanford ColBERT
**Indexing:**
```python
from colbert import Indexer
from colbert.infra import ColBERTConfig
INDEX_NAME = "KO_MOVIES_INDEX"
config = ColBERTConfig(doc_maxlen=512, nbits=2)
indexer = Indexer(
checkpoint="sigridjineth/colbert-small-korean-20241212",
config=config
)
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...']
indexer.index(name=INDEX_NAME, collection=docs)
```
**Querying:**
```python
from colbert import Searcher
from colbert.infra import ColBERTConfig
config = ColBERTConfig(query_maxlen=32)
searcher = Searcher(index=INDEX_NAME, config=config)
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?'
results = searcher.search(query, k=10)
```
**Extracting Vectors:**
```python
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig())
embedded_query = ckpt.queryFromText(["νμΈμ μμ§μ΄λ μ± μμ΄ λλΉμ μ°Έμ¬ν μ±μ°λ λꡬμΈκ°?"], bsize=16)
```
## Referencing
If you use this model or other JaColBERTv2.5-based models, please cite:
```bibtex
@article{clavie2024jacolbertv2,
title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources},
author={Clavi{\'e}, Benjamin},
journal={arXiv preprint arXiv:2407.20750},
year={2024}
}
``` |