sigridjineth's picture
Update README.md
c687746 verified
|
raw
history blame
4.63 kB
---
license: apache-2.0
datasets:
- sigridjineth/korean_nli_dataset_reranker_v0
language:
- ko
base_model:
- answerdotai/answerai-colbert-small-v1
tags:
- colbert
- korean
---
# sigridjineth/colbert-small-korean-20241212
`sigridjineth/colbert-small-korean-20241212` is a Korean multi-vector reranker model, fine-tuned from `answerai-colbert-small-v1` using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline.
Compared to other ColBERT-based models tested (`colbert-ir/colbertv2.0` and `answerai/answerai-colbert-small-v1`), `sigridjineth/colbert-small-korean-20241212` demonstrates particularly strong results at `top_k=3`, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).
## Model Comparison
The [AutoRAG Benchmark](https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark) serves as both the evaluation dataset and the toolkit for reporting these metrics.
| Model | top_k | F1 | MRR | NDCG |
|-------------------------------------------|-------|--------|---------|---------|
| colbert-ir/colbertv2.0 | 1 | 0.2456 | 0.2456 | 0.2456 |
| | 3 | 0.3596 | 0.4459 | 0.5158 |
| | 5 | 0.3596 | 0.4459 | 0.5158 |
| answerai/answerai-colbert-small-v1 | 1 | 0.2193 | 0.2193 | 0.2193 |
| | 3 | 0.3596 | 0.4240 | 0.4992 |
| | 5 | 0.3596 | 0.4240 | 0.4992 |
| sigridjineth/colbert-small-korean-20241212| 1 | 0.3772 | 0.3772 | 0.3772 |
| | 3 | 0.3596 | **0.5278** | **0.5769** |
| | 5 | 0.3596 | 0.5278 | 0.5769 | |
## Usage
### Installation
This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries:
```bash
pip install --upgrade ragatouille
pip install --upgrade colbert-ai
pip install --upgrade rerankers[transformers]
```
### Using rerankers
```python
from rerankers import Reranker
ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert')
docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']
query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
ranked_docs = ranker.rank(query=query, docs=docs)
```
### Using AGatouille
```python
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212")
docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']
RAG.index(docs, index_name="korean_cinema")
query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
results = RAG.search(query)
```
### Using Stanford ColBERT
**Indexing:**
```python
from colbert import Indexer
from colbert.infra import ColBERTConfig
INDEX_NAME = "KO_MOVIES_INDEX"
config = ColBERTConfig(doc_maxlen=512, nbits=2)
indexer = Indexer(
checkpoint="sigridjineth/colbert-small-korean-20241212",
config=config
)
docs = ['이 μ˜ν™”λŠ” λ―Έμ•Όμžν‚€ ν•˜μ•Όμ˜€κ°€ κ°λ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€...', 'μ›”νŠΈ λ””μ¦ˆλ‹ˆλŠ” 미ꡭ의 κ°λ…μ΄μž ...']
indexer.index(name=INDEX_NAME, collection=docs)
```
**Querying:**
```python
from colbert import Searcher
from colbert.infra import ColBERTConfig
config = ColBERTConfig(query_maxlen=32)
searcher = Searcher(index=INDEX_NAME, config=config)
query = 'μ„Όκ³Ό 치히둜의 ν–‰λ°©λΆˆλͺ…을 λˆ„κ°€ κ°λ…ν–ˆλ‚˜μš”?'
results = searcher.search(query, k=10)
```
**Extracting Vectors:**
```python
from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig
ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig())
embedded_query = ckpt.queryFromText(["ν•˜μšΈμ˜ μ›€μ§μ΄λŠ” μ„± μ˜μ–΄ 더빙에 μ°Έμ—¬ν•œ μ„±μš°λŠ” λˆ„κ΅¬μΈκ°€?"], bsize=16)
```
## Referencing
If you use this model or other JaColBERTv2.5-based models, please cite:
```bibtex
@article{clavie2024jacolbertv2,
title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources},
author={Clavi{\'e}, Benjamin},
journal={arXiv preprint arXiv:2407.20750},
year={2024}
}
```