|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- sigridjineth/korean_nli_dataset_reranker_v0 |
|
language: |
|
- ko |
|
base_model: |
|
- answerdotai/answerai-colbert-small-v1 |
|
tags: |
|
- colbert |
|
- korean |
|
--- |
|
|
|
# sigridjineth/colbert-small-korean-20241212 |
|
|
|
`sigridjineth/colbert-small-korean-20241212` is a Korean multi-vector reranker model, fine-tuned from `answerai-colbert-small-v1` using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline. |
|
|
|
Compared to other ColBERT-based models tested (`colbert-ir/colbertv2.0` and `answerai/answerai-colbert-small-v1`), `sigridjineth/colbert-small-korean-20241212` demonstrates particularly strong results at `top_k=3`, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). |
|
|
|
## Model Comparison |
|
The [AutoRAG Benchmark](https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark) serves as both the evaluation dataset and the toolkit for reporting these metrics. |
|
|
|
| Model | top_k | F1 | MRR | NDCG | |
|
|-------------------------------------------|-------|--------|---------|---------| |
|
| colbert-ir/colbertv2.0 | 1 | 0.2456 | 0.2456 | 0.2456 | |
|
| | 3 | 0.3596 | 0.4459 | 0.5158 | |
|
| | 5 | 0.3596 | 0.4459 | 0.5158 | |
|
| answerai/answerai-colbert-small-v1 | 1 | 0.2193 | 0.2193 | 0.2193 | |
|
| | 3 | 0.3596 | 0.4240 | 0.4992 | |
|
| | 5 | 0.3596 | 0.4240 | 0.4992 | |
|
| sigridjineth/colbert-small-korean-20241212| 1 | 0.3772 | 0.3772 | 0.3772 | |
|
| | 3 | 0.3596 | **0.5278** | **0.5769** | |
|
| | 5 | 0.3596 | 0.5278 | 0.5769 | | |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries: |
|
|
|
```bash |
|
pip install --upgrade ragatouille |
|
pip install --upgrade colbert-ai |
|
pip install --upgrade rerankers[transformers] |
|
``` |
|
|
|
### Using rerankers |
|
|
|
```python |
|
from rerankers import Reranker |
|
|
|
ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert') |
|
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...'] |
|
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?' |
|
ranked_docs = ranker.rank(query=query, docs=docs) |
|
``` |
|
|
|
### Using AGatouille |
|
|
|
```python |
|
from ragatouille import RAGPretrainedModel |
|
|
|
RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212") |
|
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...'] |
|
|
|
RAG.index(docs, index_name="korean_cinema") |
|
|
|
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?' |
|
results = RAG.search(query) |
|
``` |
|
|
|
### Using Stanford ColBERT |
|
|
|
**Indexing:** |
|
```python |
|
from colbert import Indexer |
|
from colbert.infra import ColBERTConfig |
|
|
|
INDEX_NAME = "KO_MOVIES_INDEX" |
|
config = ColBERTConfig(doc_maxlen=512, nbits=2) |
|
|
|
indexer = Indexer( |
|
checkpoint="sigridjineth/colbert-small-korean-20241212", |
|
config=config |
|
) |
|
|
|
docs = ['μ΄ μνλ λ―ΈμΌμν€ νμΌμ€κ° κ°λ
νμμ΅λλ€...', 'μνΈ λμ¦λλ λ―Έκ΅μ κ°λ
μ΄μ ...'] |
|
indexer.index(name=INDEX_NAME, collection=docs) |
|
``` |
|
|
|
**Querying:** |
|
```python |
|
from colbert import Searcher |
|
from colbert.infra import ColBERTConfig |
|
|
|
config = ColBERTConfig(query_maxlen=32) |
|
searcher = Searcher(index=INDEX_NAME, config=config) |
|
|
|
query = 'μΌκ³Ό μΉνλ‘μ νλ°©λΆλͺ
μ λκ° κ°λ
νλμ?' |
|
results = searcher.search(query, k=10) |
|
``` |
|
|
|
**Extracting Vectors:** |
|
```python |
|
from colbert.modeling.checkpoint import Checkpoint |
|
from colbert.infra import ColBERTConfig |
|
|
|
ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig()) |
|
embedded_query = ckpt.queryFromText(["νμΈμ μμ§μ΄λ μ± μμ΄ λλΉμ μ°Έμ¬ν μ±μ°λ λꡬμΈκ°?"], bsize=16) |
|
``` |
|
|
|
## Referencing |
|
|
|
If you use this model or other JaColBERTv2.5-based models, please cite: |
|
|
|
```bibtex |
|
@article{clavie2024jacolbertv2, |
|
title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources}, |
|
author={Clavi{\'e}, Benjamin}, |
|
journal={arXiv preprint arXiv:2407.20750}, |
|
year={2024} |
|
} |
|
``` |