yjoonjang commited on
Commit
62de5e3
Β·
verified Β·
1 Parent(s): 07fe920

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -13
README.md CHANGED
@@ -6,12 +6,19 @@ tags:
6
  - generated_from_trainer
7
  - dataset_size:1879136
8
  - loss:CachedGISTEmbedLoss
9
-
 
 
 
 
 
 
 
10
  ---
11
 
12
  # πŸ”Ž KURE-v1
13
 
14
- Introducing Korea University Retrieval Embedding model, KURE-v1: a model with advanced retrieval abilities.
15
  It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models.
16
  To our knowledge, It is one of the best publicly opened Korean retrieval models.
17
 
@@ -19,7 +26,13 @@ For details, visit the [KURE repository](https://github.com/nlpai-lab/KURE)
19
 
20
  ---
21
 
22
- ### Model Description
 
 
 
 
 
 
23
 
24
  This is the model card of a πŸ€— transformers model that has been pushed on the Hub.
25
 
@@ -67,8 +80,8 @@ print(similarities)
67
  ### Training Data
68
 
69
  #### KURE-v1
70
- - ν•œκ΅­μ–΄ query-document-hard_negative(5개) 데이터 쌍
71
- - μ•½ 2,000,000 examples
72
 
73
  ### Training Procedure
74
  loss: CachedGISTEmbedLoss
@@ -82,22 +95,114 @@ loss: CachedGISTEmbedLoss
82
  ### Metrics
83
  - Recall, Precision, NDCG, F1
84
  ### Benchmark Datasets
85
- - Ko-StrategyQA: ν•œκ΅­μ–΄ ODQA multi-hop 검색 데이터셋 (StrategyQA λ²ˆμ—­)
86
- - AutoRAGRetrieval: 금육, 곡곡, 의료, 법λ₯ , 컀머슀 5개 뢄야에 λŒ€ν•΄, pdfλ₯Ό νŒŒμ‹±ν•˜μ—¬ κ΅¬μ„±ν•œ ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
87
- - MIRACLRetrieval: Wikipedia 기반의 ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
88
- - PublicHealthQA: 의료 및 곡쀑보건 도메인에 λŒ€ν•œ ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
89
- - BelebeleRetrieval: FLORES-200 기반의 ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
90
- - MrTidyRetrieval: Wikipedia 기반의 ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
91
- - MultiLongDocRetrieval: λ‹€μ–‘ν•œ λ„λ©”μΈμ˜ ν•œκ΅­μ–΄ μž₯λ¬Έ 검색 데이터셋
92
- - XPQARetrieval: λ‹€μ–‘ν•œ λ„λ©”μΈμ˜ ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
93
 
94
  ## Results
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
  ## Citation
98
 
99
  If you find our paper or models helpful, please consider cite as follows:
100
  ```text
 
 
 
 
 
 
101
  @misc{KoE5,
102
  author = {NLP & AI Lab and Human-Inspired AI research},
103
  title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},
 
6
  - generated_from_trainer
7
  - dataset_size:1879136
8
  - loss:CachedGISTEmbedLoss
9
+ license: mit
10
+ metrics:
11
+ - recall
12
+ - precision
13
+ - f1
14
+ base_model:
15
+ - BAAI/bge-m3
16
+ library_name: sentence-transformers
17
  ---
18
 
19
  # πŸ”Ž KURE-v1
20
 
21
+ Introducing Korea University Retrieval Embedding model, KURE-v1
22
  It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models.
23
  To our knowledge, It is one of the best publicly opened Korean retrieval models.
24
 
 
26
 
27
  ---
28
 
29
+ ## Model Versions
30
+ | Model Name | Dimension | Sequence Length | Introduction |
31
+ |:----:|:---:|:---:|:---:|
32
+ | [KURE-v1](https://huggingface.co/nlpai-lab/KURE-v1) | 1024 | 8192 | Fine-tuned [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) with Korean data via [CachedGISTEmbedLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedgistembedloss)
33
+ | [KoE5](https://huggingface.co/nlpai-lab/KoE5) | 1024 | 512 | Fine-tuned [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) with [ko-triplet-v1.0](https://huggingface.co/datasets/nlpai-lab/ko-triplet-v1.0) via [CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) |
34
+
35
+ ## Model Description
36
 
37
  This is the model card of a πŸ€— transformers model that has been pushed on the Hub.
38
 
 
80
  ### Training Data
81
 
82
  #### KURE-v1
83
+ - Korean query-document-hard_negative(5) data
84
+ - 2,000,000 examples
85
 
86
  ### Training Procedure
87
  loss: CachedGISTEmbedLoss
 
95
  ### Metrics
96
  - Recall, Precision, NDCG, F1
97
  ### Benchmark Datasets
98
+ - [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): ν•œκ΅­μ–΄ ODQA multi-hop 검색 데이터셋 (StrategyQA λ²ˆμ—­)
99
+ - [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): 금육, 곡곡, 의료, 법λ₯ , 컀머슀 5개 뢄야에 λŒ€ν•΄, pdfλ₯Ό νŒŒμ‹±ν•˜μ—¬ κ΅¬μ„±ν•œ ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
100
+ - [MIRACLRetrieval]([url](https://huggingface.co/datasets/miracl/miracl)): Wikipedia 기반의 ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
101
+ - [PublicHealthQA]([url](https://huggingface.co/datasets/xhluca/publichealth-qa)): 의료 및 곡쀑보건 도메인에 λŒ€ν•œ ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
102
+ - [BelebeleRetrieval]([url](https://huggingface.co/datasets/facebook/belebele)): FLORES-200 기반의 ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
103
+ - [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): Wikipedia 기반의 ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
104
+ - [MultiLongDocRetrieval](https://huggingface.co/datasets/Shitao/MLDR): λ‹€μ–‘ν•œ λ„λ©”μΈμ˜ ν•œκ΅­μ–΄ μž₯λ¬Έ 검색 데이터셋
105
+ - [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): λ‹€μ–‘ν•œ λ„λ©”μΈμ˜ ν•œκ΅­μ–΄ λ¬Έμ„œ 검색 데이터셋
106
 
107
  ## Results
108
 
109
+ μ•„λž˜λŠ” λͺ¨λ“  λͺ¨λΈμ˜, λͺ¨λ“  벀치마크 데이터셋에 λŒ€ν•œ 평균 κ²°κ³Όμž…λ‹ˆλ‹€.
110
+ μžμ„Έν•œ κ²°κ³ΌλŠ” [KURE Github](https://github.com/nlpai-lab/KURE/tree/main/eval/results)μ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.
111
+ ### Top-k 1
112
+ | Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
113
+ |-----------------------------------------|----------------------|------------------------|-------------------|-----------------|
114
+ | **nlpai-lab/KURE-v1** | **0.52640** | **0.60551** | **0.60551** | **0.55784** |
115
+ | dragonkue/BGE-m3-ko | 0.52361 | 0.60394 | 0.60394 | 0.55535 |
116
+ | BAAI/bge-m3 | 0.51778 | 0.59846 | 0.59846 | 0.54998 |
117
+ | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.51246 | 0.59384 | 0.59384 | 0.54489 |
118
+ | nlpai-lab/KoE5 | 0.50157 | 0.57790 | 0.57790 | 0.53178 |
119
+ | intfloat/multilingual-e5-large | 0.50052 | 0.57727 | 0.57727 | 0.53122 |
120
+ | jinaai/jina-embeddings-v3 | 0.48287 | 0.56068 | 0.56068 | 0.51361 |
121
+ | BAAI/bge-multilingual-gemma2 | 0.47904 | 0.55472 | 0.55472 | 0.50916 |
122
+ | intfloat/multilingual-e5-large-instruct | 0.47842 | 0.55435 | 0.55435 | 0.50826 |
123
+ | intfloat/multilingual-e5-base | 0.46950 | 0.54490 | 0.54490 | 0.49947 |
124
+ | intfloat/e5-mistral-7b-instruct | 0.46772 | 0.54394 | 0.54394 | 0.49781 |
125
+ | Alibaba-NLP/gte-multilingual-base | 0.46469 | 0.53744 | 0.53744 | 0.49353 |
126
+ | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.46633 | 0.53625 | 0.53625 | 0.49429 |
127
+ | openai/text-embedding-3-large | 0.44884 | 0.51688 | 0.51688 | 0.47572 |
128
+ | Salesforce/SFR-Embedding-2_R | 0.43748 | 0.50815 | 0.50815 | 0.46504 |
129
+ | upskyy/bge-m3-korean | 0.43125 | 0.50245 | 0.50245 | 0.45945 |
130
+ | jhgan/ko-sroberta-multitask | 0.33788 | 0.38497 | 0.38497 | 0.35678 |
131
+
132
+ ### Top-k 3
133
+ | Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
134
+ |-----------------------------------------|----------------------|------------------------|-------------------|-----------------|
135
+ | **nlpai-lab/KURE-v1** | **0.68678** | **0.28711** | **0.65538** | **0.39835** |
136
+ | dragonkue/BGE-m3-ko | 0.67834 | 0.28385 | 0.64950 | 0.39378 |
137
+ | BAAI/bge-m3 | 0.67526 | 0.28374 | 0.64556 | 0.39291 |
138
+ | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.67128 | 0.28193 | 0.64042 | 0.39072 |
139
+ | intfloat/multilingual-e5-large | 0.65807 | 0.27777 | 0.62822 | 0.38423 |
140
+ | nlpai-lab/KoE5 | 0.65174 | 0.27329 | 0.62369 | 0.37882 |
141
+ | BAAI/bge-multilingual-gemma2 | 0.64415 | 0.27416 | 0.61105 | 0.37782 |
142
+ | jinaai/jina-embeddings-v3 | 0.64116 | 0.27165 | 0.60954 | 0.37511 |
143
+ | intfloat/multilingual-e5-large-instruct | 0.64353 | 0.27040 | 0.60790 | 0.37453 |
144
+ | Alibaba-NLP/gte-multilingual-base | 0.63744 | 0.26404 | 0.59695 | 0.36764 |
145
+ | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.63163 | 0.25937 | 0.59237 | 0.36263 |
146
+ | intfloat/multilingual-e5-base | 0.62099 | 0.26144 | 0.59179 | 0.36203 |
147
+ | intfloat/e5-mistral-7b-instruct | 0.62087 | 0.26144 | 0.58917 | 0.36188 |
148
+ | openai/text-embedding-3-large | 0.61035 | 0.25356 | 0.57329 | 0.35270 |
149
+ | Salesforce/SFR-Embedding-2_R | 0.60001 | 0.25253 | 0.56346 | 0.34952 |
150
+ | upskyy/bge-m3-korean | 0.59215 | 0.25076 | 0.55722 | 0.34623 |
151
+ | jhgan/ko-sroberta-multitask | 0.46930 | 0.18994 | 0.43293 | 0.26696 |
152
+
153
+ ### Top-k 5
154
+ | Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
155
+ |-----------------------------------------|----------------------|------------------------|-------------------|-----------------|
156
+ | **nlpai-lab/KURE-v1** | **0.73851** | **0.19130** | **0.67479** | **0.29903** |
157
+ | dragonkue/BGE-m3-ko | 0.72517 | 0.18799 | 0.66692 | 0.29401 |
158
+ | BAAI/bge-m3 | 0.72954 | 0.18975 | 0.66615 | 0.29632 |
159
+ | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.72962 | 0.18875 | 0.66236 | 0.29542 |
160
+ | nlpai-lab/KoE5 | 0.70820 | 0.18287 | 0.64499 | 0.28628 |
161
+ | intfloat/multilingual-e5-large | 0.70124 | 0.18316 | 0.64402 | 0.28588 |
162
+ | BAAI/bge-multilingual-gemma2 | 0.70258 | 0.18556 | 0.63338 | 0.28851 |
163
+ | jinaai/jina-embeddings-v3 | 0.69933 | 0.18256 | 0.63133 | 0.28505 |
164
+ | intfloat/multilingual-e5-large-instruct | 0.69018 | 0.17838 | 0.62486 | 0.27933 |
165
+ | Alibaba-NLP/gte-multilingual-base | 0.69365 | 0.17789 | 0.61896 | 0.27879 |
166
+ | intfloat/multilingual-e5-base | 0.67250 | 0.17406 | 0.61119 | 0.27247 |
167
+ | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.67447 | 0.17114 | 0.60952 | 0.26943 |
168
+ | intfloat/e5-mistral-7b-instruct | 0.67449 | 0.17484 | 0.60935 | 0.27349 |
169
+ | openai/text-embedding-3-large | 0.66365 | 0.17004 | 0.59389 | 0.26677 |
170
+ | Salesforce/SFR-Embedding-2_R | 0.65622 | 0.17018 | 0.58494 | 0.26612 |
171
+ | upskyy/bge-m3-korean | 0.65477 | 0.17015 | 0.58073 | 0.26589 |
172
+ | jhgan/ko-sroberta-multitask | 0.53136 | 0.13264 | 0.45879 | 0.20976 |
173
+
174
+ ### Top-k 10
175
+ | Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
176
+ |-----------------------------------------|----------------------|------------------------|-------------------|-----------------|
177
+ | **nlpai-lab/KURE-v1** | **0.79682** | **0.10624** | **0.69473** | **0.18524** |
178
+ | dragonkue/BGE-m3-ko | 0.78450 | 0.10492 | 0.68748 | 0.18288 |
179
+ | BAAI/bge-m3 | 0.79195 | 0.10592 | 0.68723 | 0.18456 |
180
+ | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.78669 | 0.10462 | 0.68189 | 0.18260 |
181
+ | intfloat/multilingual-e5-large | 0.75902 | 0.10147 | 0.66370 | 0.17693 |
182
+ | nlpai-lab/KoE5 | 0.75296 | 0.09937 | 0.66012 | 0.17369 |
183
+ | BAAI/bge-multilingual-gemma2 | 0.76153 | 0.10364 | 0.65330 | 0.18003 |
184
+ | jinaai/jina-embeddings-v3 | 0.76277 | 0.10240 | 0.65290 | 0.17843 |
185
+ | intfloat/multilingual-e5-large-instruct | 0.74851 | 0.09888 | 0.64451 | 0.17283 |
186
+ | Alibaba-NLP/gte-multilingual-base | 0.75631 | 0.09938 | 0.64025 | 0.17363 |
187
+ | Alibaba-NLP/gte-Qwen2-7B-instruct | 0.74092 | 0.09607 | 0.63258 | 0.16847 |
188
+ | intfloat/multilingual-e5-base | 0.73512 | 0.09717 | 0.63216 | 0.16977 |
189
+ | intfloat/e5-mistral-7b-instruct | 0.73795 | 0.09777 | 0.63076 | 0.17078 |
190
+ | openai/text-embedding-3-large | 0.72946 | 0.09571 | 0.61670 | 0.16739 |
191
+ | Salesforce/SFR-Embedding-2_R | 0.71662 | 0.09546 | 0.60589 | 0.16651 |
192
+ | upskyy/bge-m3-korean | 0.71895 | 0.09583 | 0.60258 | 0.16712 |
193
+ | jhgan/ko-sroberta-multitask | 0.61225 | 0.07826 | 0.48687 | 0.13757 |
194
+ <br/>
195
 
196
  ## Citation
197
 
198
  If you find our paper or models helpful, please consider cite as follows:
199
  ```text
200
+ @misc{KURE,
201
+ publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
202
+ year = {2024},
203
+ url = {https://github.com/nlpai-lab/KURE}
204
+ },
205
+
206
  @misc{KoE5,
207
  author = {NLP & AI Lab and Human-Inspired AI research},
208
  title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},