--- license: apache-2.0 datasets: - sigridjineth/korean_nli_dataset_reranker_v0 language: - ko base_model: - answerdotai/answerai-colbert-small-v1 tags: - colbert - korean --- # sigridjineth/colbert-small-korean-20241212 `sigridjineth/colbert-small-korean-20241212` is a Korean multi-vector reranker model, fine-tuned from `answerai-colbert-small-v1` using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline. Compared to other ColBERT-based models tested (`colbert-ir/colbertv2.0` and `answerai/answerai-colbert-small-v1`), `sigridjineth/colbert-small-korean-20241212` demonstrates particularly strong results at `top_k=3`, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). ## Model Comparison The [AutoRAG Benchmark](https://github.com/Marker-Inc-Korea/AutoRAG-example-korean-embedding-benchmark) serves as both the evaluation dataset and the toolkit for reporting these metrics. | Model | top_k | F1 | MRR | NDCG | |-------------------------------------------|-------|--------|---------|---------| | colbert-ir/colbertv2.0 | 1 | 0.2456 | 0.2456 | 0.2456 | | | 3 | 0.3596 | 0.4459 | 0.5158 | | | 5 | 0.3596 | 0.4459 | 0.5158 | | answerai/answerai-colbert-small-v1 | 1 | 0.2193 | 0.2193 | 0.2193 | | | 3 | 0.3596 | 0.4240 | 0.4992 | | | 5 | 0.3596 | 0.4240 | 0.4992 | | sigridjineth/colbert-small-korean-20241212| 1 | 0.3772 | 0.3772 | 0.3772 | | | 3 | 0.3596 | **0.5278** | **0.5769** | | | 5 | 0.3596 | 0.5278 | 0.5769 | | ## Usage ### Installation This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries: ```bash pip install --upgrade ragatouille pip install --upgrade colbert-ai pip install --upgrade rerankers[transformers] ``` ### Using rerankers ```python from rerankers import Reranker ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert') docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...'] query = '센과 치히로의 행방불명을 누가 감독했나요?' ranked_docs = ranker.rank(query=query, docs=docs) ``` ### Using AGatouille ```python from ragatouille import RAGPretrainedModel RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212") docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...'] RAG.index(docs, index_name="korean_cinema") query = '센과 치히로의 행방불명을 누가 감독했나요?' results = RAG.search(query) ``` ### Using Stanford ColBERT **Indexing:** ```python from colbert import Indexer from colbert.infra import ColBERTConfig INDEX_NAME = "KO_MOVIES_INDEX" config = ColBERTConfig(doc_maxlen=512, nbits=2) indexer = Indexer( checkpoint="sigridjineth/colbert-small-korean-20241212", config=config ) docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...'] indexer.index(name=INDEX_NAME, collection=docs) ``` **Querying:** ```python from colbert import Searcher from colbert.infra import ColBERTConfig config = ColBERTConfig(query_maxlen=32) searcher = Searcher(index=INDEX_NAME, config=config) query = '센과 치히로의 행방불명을 누가 감독했나요?' results = searcher.search(query, k=10) ``` **Extracting Vectors:** ```python from colbert.modeling.checkpoint import Checkpoint from colbert.infra import ColBERTConfig ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig()) embedded_query = ckpt.queryFromText(["하울의 움직이는 성 영어 더빙에 참여한 성우는 누구인가?"], bsize=16) ``` ## Referencing If you use this model or other JaColBERTv2.5-based models, please cite: ```bibtex @article{clavie2024jacolbertv2, title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources}, author={Clavi{\'e}, Benjamin}, journal={arXiv preprint arXiv:2407.20750}, year={2024} } ```