colbert-small-korean-20241212 / README.md

Update README.md

c687746 verified about 2 months ago

4.63 kB

	---
	license: apache-2.0
	datasets:
	- sigridjineth/korean_nli_dataset_reranker_v0
	language:
	- ko
	base_model:
	- answerdotai/answerai-colbert-small-v1
	tags:
	- colbert
	- korean
	---

	# sigridjineth/colbert-small-korean-20241212

	`sigridjineth/colbert-small-korean-20241212` is a Korean multi-vector reranker model, fine-tuned from `answerai-colbert-small-v1` using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline.

	Compared to other ColBERT-based models tested (`colbert-ir/colbertv2.0` and `answerai/answerai-colbert-small-v1`), `sigridjineth/colbert-small-korean-20241212` demonstrates particularly strong results at `top_k=3`, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).

	## Model Comparison
	The [AutoRAG Benchmark](https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark) serves as both the evaluation dataset and the toolkit for reporting these metrics.

	\| Model \| top_k \| F1 \| MRR \| NDCG \|
	\|-------------------------------------------\|-------\|--------\|---------\|---------\|
	\| colbert-ir/colbertv2.0 \| 1 \| 0.2456 \| 0.2456 \| 0.2456 \|
	\| \| 3 \| 0.3596 \| 0.4459 \| 0.5158 \|
	\| \| 5 \| 0.3596 \| 0.4459 \| 0.5158 \|
	\| answerai/answerai-colbert-small-v1 \| 1 \| 0.2193 \| 0.2193 \| 0.2193 \|
	\| \| 3 \| 0.3596 \| 0.4240 \| 0.4992 \|
	\| \| 5 \| 0.3596 \| 0.4240 \| 0.4992 \|
	\| sigridjineth/colbert-small-korean-20241212\| 1 \| 0.3772 \| 0.3772 \| 0.3772 \|
	\| \| 3 \| 0.3596 \| 0.5278 \| 0.5769 \|
	\| \| 5 \| 0.3596 \| 0.5278 \| 0.5769 \| \|

	## Usage

	### Installation

	This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries:

	```bash
	pip install --upgrade ragatouille
	pip install --upgrade colbert-ai
	pip install --upgrade rerankers[transformers]
	```

	### Using rerankers

	```python
	from rerankers import Reranker

	ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert')
	docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']
	query = '센과 치히로의 행방불명을 누가 감독했나요?'
	ranked_docs = ranker.rank(query=query, docs=docs)
	```

	### Using AGatouille

	```python
	from ragatouille import RAGPretrainedModel

	RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212")
	docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']

	RAG.index(docs, index_name="korean_cinema")

	query = '센과 치히로의 행방불명을 누가 감독했나요?'
	results = RAG.search(query)
	```

	### Using Stanford ColBERT

	Indexing:
	```python
	from colbert import Indexer
	from colbert.infra import ColBERTConfig

	INDEX_NAME = "KO_MOVIES_INDEX"
	config = ColBERTConfig(doc_maxlen=512, nbits=2)

	indexer = Indexer(
	checkpoint="sigridjineth/colbert-small-korean-20241212",
	config=config
	)

	docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']
	indexer.index(name=INDEX_NAME, collection=docs)
	```

	Querying:
	```python
	from colbert import Searcher
	from colbert.infra import ColBERTConfig

	config = ColBERTConfig(query_maxlen=32)
	searcher = Searcher(index=INDEX_NAME, config=config)

	query = '센과 치히로의 행방불명을 누가 감독했나요?'
	results = searcher.search(query, k=10)
	```

	Extracting Vectors:
	```python
	from colbert.modeling.checkpoint import Checkpoint
	from colbert.infra import ColBERTConfig

	ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig())
	embedded_query = ckpt.queryFromText(["하울의 움직이는 성 영어 더빙에 참여한 성우는 누구인가?"], bsize=16)
	```

	## Referencing

	If you use this model or other JaColBERTv2.5-based models, please cite:

	```bibtex
	@article{clavie2024jacolbertv2,
	title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources},
	author={Clavi{\'e}, Benjamin},
	journal={arXiv preprint arXiv:2407.20750},
	year={2024}
	}
	```