KISTI-AIDATA
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,110 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-nc-3.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-3.0
|
3 |
+
---
|
4 |
+
|
5 |
+
# ๊ณผํ๊ธฐ์ ๋ถ์ผ BERT ์ฌ์ ํ์ต ๋ชจ๋ธ (KorSci BERT)
|
6 |
+
๋ณธ KorSci BERT ์ธ์ด๋ชจ๋ธ์ ํ๊ตญ๊ณผํ๊ธฐ์ ์ ๋ณด์ฐ๊ตฌ์๊ณผ ํ๊ตญํนํ์ ๋ณด์์ด ๊ณต๋์ผ๋ก ์ฐ๊ตฌํ ๊ณผ์ ์ ๊ฒฐ๊ณผ๋ฌผ ์ค ํ๋๋ก, ๊ธฐ์กด [Google BERT base](https://github.com/google-research/bert) ๋ชจ๋ธ์ ์ํคํ
์ณ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก, ํ๊ตญ ๋
ผ๋ฌธ & ํนํ ์ฝํผ์ค ์ด 97G (์ฝ 3์ต 8์ฒ๋ง ๋ฌธ์ฅ)๋ฅผ ์ฌ์ ํ์ตํ ๋ชจ๋ธ์ด๋ค.
|
7 |
+
|
8 |
+
## Train dataset
|
9 |
+
|Type|Corpus|Sentences|Avg sent length|
|
10 |
+
|--|--|--|--|
|
11 |
+
|๋
ผ๋ฌธ|15G|72,735,757|122.11|
|
12 |
+
|ํนํ|82G|316,239,927|120.91|
|
13 |
+
|ํฉ๊ณ|97G|388,975,684|121.13|
|
14 |
+
|
15 |
+
## Model architecture
|
16 |
+
- attention_probs_dropout_prob:0.1
|
17 |
+
- directionality:"bidi"
|
18 |
+
- hidden_act:"gelu"
|
19 |
+
- hidden_dropout_prob:0.1
|
20 |
+
- hidden_size:768
|
21 |
+
- initializer_range:0.02
|
22 |
+
- intermediate_size:3072
|
23 |
+
- max_position_embeddings:512
|
24 |
+
- num_attention_heads:12
|
25 |
+
- num_hidden_layers:12
|
26 |
+
- pooler_fc_size:768
|
27 |
+
- pooler_num_attention_heads:12
|
28 |
+
- pooler_num_fc_layers:3
|
29 |
+
- pooler_size_per_head:128
|
30 |
+
- pooler_type:"first_token_transform"
|
31 |
+
- type_vocab_size:2
|
32 |
+
- vocab_size:15330
|
33 |
+
|
34 |
+
## Vocabulary
|
35 |
+
- Total 15,330 words
|
36 |
+
- Included special tokens ( [PAD], [UNK], [CLS], [SEP], [MASK] )
|
37 |
+
- File name : vocab_kisti.txt
|
38 |
+
|
39 |
+
## Language model
|
40 |
+
- Model file : model.ckpt-262500 (Tensorflow ckpt file)
|
41 |
+
|
42 |
+
## Pre training
|
43 |
+
- Trained 128 Seq length 1,600,000 + 512 Seq length 500,000 ์คํ
ํ์ต
|
44 |
+
- ๋
ผ๋ฌธ+ํนํ (97 GB) ๋ง๋ญ์น์ 3์ต 8์ฒ๋ง ๋ฌธ์ฅ ๋ฐ์ดํฐ ํ์ต
|
45 |
+
- NVIDIA V100 32G 8EA GPU ๋ถ์ฐํ์ต with [Horovod Lib](https://github.com/horovod/horovod)
|
46 |
+
- NVIDIA [Automixed Mixed Precision](https://developer.nvidia.com/automatic-mixed-precision) ๋ฐฉ์ ์ฌ์ฉ
|
47 |
+
|
48 |
+
## Downstream task evaluation
|
49 |
+
๋ณธ ์ธ์ด๋ชจ๋ธ์ ์ฑ๋ฅํ๊ฐ๋ ๊ณผํ๊ธฐ์ ํ์ค๋ถ๋ฅ ๋ฐ ํนํ ์ ์งํนํ๋ถ๋ฅ([CPC](https://www.kipo.go.kr/kpo/HtmlApp?c=4021&catmenu=m06_07_01)) 2๊ฐ์ง์ ํ์คํฌ๋ฅผ ํ์ธํ๋ํ์ฌ ํ๊ฐํ๋ ๋ฐฉ์์ ์ฌ์ฉํ์์ผ๋ฉฐ, ๊ทธ ๊ฒฐ๊ณผ๋ ์๋์ ๊ฐ๋ค.
|
50 |
+
|Type|Classes|Train|Test|Metric|Train result|Test result|
|
51 |
+
|--|--|--|--|--|--|--|
|
52 |
+
|๊ณผํ๊ธฐ์ ํ์ค๋ถ๋ฅ|86|130,515|14,502|Accuracy|68.21|70.31|
|
53 |
+
|ํนํCPC๋ถ๋ฅ|144|390,540|16,315|Accuracy|86.87|76.25|
|
54 |
+
|
55 |
+
|
56 |
+
# ๊ณผํ๊ธฐ์ ๋ถ์ผ ํ ํฌ๋์ด์ (KorSci Tokenizer)
|
57 |
+
|
58 |
+
๋ณธ ํ ํฌ๋์ด์ ๋ ํ๊ตญ๊ณผํ๊ธฐ์ ์ ๋ณด์ฐ๊ตฌ์๊ณผ ํ๊ตญํนํ์ ๋ณด์์ด ๊ณต๋์ผ๋ก ์ฐ๊ตฌํ ๊ณผ์ ์ ๊ฒฐ๊ณผ๋ฌผ ์ค ํ๋์ด๋ค. ๊ทธ๋ฆฌ๊ณ , ์ ์ฌ์ ํ์ต ๋ชจ๋ธ์์ ์ฌ์ฉ๋ ์ฝํผ์ค๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๋ช
์ฌ ๋ฐ ๋ณตํฉ๋ช
์ฌ ์ฝ 600๋ง๊ฐ์ ์ฌ์ฉ์์ฌ์ ์ด ์ถ๊ฐ๋ [Mecab-ko Tokenizer](https://bitbucket.org/eunjeon/mecab-ko/src/master/)์ ๊ธฐ์กด [BERT WordPiece Tokenizer](https://github.com/google-research/bert)๊ฐ ๋ณํฉ๋์ด์ง ํ ํฌ๋์ด์ ์ด๋ค.
|
59 |
+
|
60 |
+
## ๋ชจ๋ธ ๋ค์ด๋ก๋
|
61 |
+
http://doi.org/10.23057/46
|
62 |
+
|
63 |
+
## ์๊ตฌ์ฌํญ
|
64 |
+
|
65 |
+
### ์์ ํ๋ข Mecab ์ค์น & ์ฌ์ฉ์์ฌ์ ์ถ๊ฐ
|
66 |
+
Installation URL: https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/
|
67 |
+
mecab-ko > 0.996-ko-0.9.2
|
68 |
+
mecab-ko-dic > 2.1.1
|
69 |
+
mecab-python > 0.996-ko-0.9.2
|
70 |
+
|
71 |
+
### ๋
ผ๋ฌธ & ํนํ ์ฌ์ฉ์ ์ฌ์
|
72 |
+
- ๋
ผ๋ฌธ ์ฌ์ฉ์ ์ฌ์ : pap_all_mecab_dic.csv (1,001,328 words)
|
73 |
+
- ํนํ ์ฌ์ฉ์ ์ฌ์ : pat_all_mecab_dic.csv (5,000,000 words)
|
74 |
+
|
75 |
+
### konlpy ์ค์น
|
76 |
+
pip install konlpy
|
77 |
+
konlpy > 0.5.2
|
78 |
+
|
79 |
+
## ์ฌ์ฉ๋ฐฉ๋ฒ
|
80 |
+
import tokenization_kisti as tokenization
|
81 |
+
|
82 |
+
vocab_file = "./vocab_kisti.txt"
|
83 |
+
|
84 |
+
tokenizer = tokenization.FullTokenizer(
|
85 |
+
vocab_file=vocab_file,
|
86 |
+
do_lower_case=False,
|
87 |
+
tokenizer_type="Mecab"
|
88 |
+
)
|
89 |
+
|
90 |
+
example = "๋ณธ ๊ณ ์์ ์ฃผ๋ก ์ผํ์ฉ ํฉ์ฑ์ธ์ ์ก์ ์ง์ด๋ฃ์ด ๋ฐ๋ดํ๋ ์ธ์ ์กํฌ์ ๋ด๋ถ๋ฅผ ์ํธ์์ผ๋ก ์ด์ค์ฐฉํ๋ ์ธ์ ์ก์ด ๋ฐฐ์ถ๋๋ ์ ๋จ๋ถ ์ชฝ์ผ๋ก ๋ด๋ฒฝ์ ํ์ํ๊ฒ ํ์ฑํ์ฌ์ ๋ด๋ถ์ ๋ค์ด์๋ ์ธ์ ์ก์ ์์ง์ง ์ ์๋๋ก ํ๋ ํฉ์ฑ์ธ์ ์กํฌ์ ๊ดํ ๊ฒ์ด๋ค."
|
91 |
+
tokens = tokenizer.tokenize(example)
|
92 |
+
encoded_tokens = tokenizer.convert_tokens_to_ids(tokens)
|
93 |
+
decoded_tokens = tokenizer.convert_ids_to_tokens(encoded_tokens)
|
94 |
+
|
95 |
+
print("Input example ===>", example)
|
96 |
+
print("Tokenized example ===>", tokens)
|
97 |
+
print("Converted example to IDs ===>", encoded_tokens)
|
98 |
+
print("Converted IDs to example ===>", decoded_tokens)
|
99 |
+
|
100 |
+
============ Result ================
|
101 |
+
Input example ===> ๋ณธ ๊ณ ์์ ์ฃผ๋ก ์ผํ์ฉ ํฉ์ฑ์ธ์ ์ก์ ์ง์ด๋ฃ์ด ๋ฐ๋ดํ๋ ์ธ์ ์กํฌ์ ๋ด๋ถ๋ฅผ ์ํธ์์ผ๋ก ์ด์ค์ฐฉํ๋ ์ธ์ ์ก์ด ๋ฐฐ์ถ๋๋ ์ ๋จ๋ถ ์ชฝ์ผ๋ก ๋ด๋ฒฝ์ ํ์ํ๊ฒ ํ์ฑํ์ฌ์ ๋ด๋ถ์ ๋ค์ด์๋ ์ธ์ ์ก์ ์์ง์ง ์ ์๋๋ก ํ๋ ํฉ์ฑ์ธ์ ์กํฌ์ ๊ดํ ๊ฒ์ด๋ค.
|
102 |
+
Tokenized example ===> ['๋ณธ', '๊ณ ์', '์', '์ฃผ๋ก', '์ผํ์ฉ', 'ํฉ์ฑ', '##์ธ', '##์ ', '##์ก', '์', '์ง', '##์ด', '##๋ฃ', '์ด', '๋ฐ๋ด', 'ํ', '๋', '์ธ์ ', '##์ก', '##ํฌ', '์', '๋ด๋ถ', '๋ฅผ', '์ํธ', '์', '์ผ๋ก', '์ด', '##์ค', '์ฐฉ', '##ํ', '๋', '์ธ์ ', '##์ก', '์ด', '๋ฐฐ์ถ', '๋', '๋', '์ ๋จ๋ถ', '์ชฝ', '์ผ๋ก', '๋ด๋ฒฝ', '์', 'ํ', '##์', 'ํ', '๊ฒ', 'ํ์ฑ', 'ํ', '์ฌ์', '๋ด๋ถ', '์', '๋ค', '์ด', '์', '๋', '์ธ์ ', '##์ก', '์', '์', '์ง', '์ง', '์', '์', '๋๋ก', 'ํ', '๋', 'ํฉ์ฑ', '##์ธ', '##์ ', '์ก', '##ํฌ', '์', '๊ดํ', '๊ฒ', '์ด', '๋ค', '.']
|
103 |
+
Converted example to IDs ===> [59, 619, 30, 2336, 8268, 819, 14100, 13986, 14198, 15, 732, 13994, 14615, 39, 1964, 12, 11, 6174, 14198, 14061, 9, 366, 16, 7267, 18, 32, 307, 14072, 891, 13967, 27, 6174, 14198, 14, 698, 27, 11, 12920, 1972, 32, 4482, 15, 2228, 14053, 12, 65, 117, 12, 4477, 366, 10, 56, 39, 26, 11, 6174, 14198, 15, 1637, 13709, 398, 25, 26, 140, 12, 11, 819, 14100, 13986, 377, 14061, 10, 487, 55, 14, 17, 13]
|
104 |
+
Converted IDs to example ===> ['๋ณธ', '๊ณ ์', '์', '์ฃผ๋ก', '์ผํ์ฉ', 'ํฉ์ฑ', '##์ธ', '##์ ', '##์ก', '์', '์ง', '##์ด', '##๋ฃ', '์ด', '๋ฐ๋ด', 'ํ', '๋', '์ธ์ ', '##์ก', '##ํฌ', '์', '๋ด๋ถ', '๋ฅผ', '์ํธ', '์', '์ผ๋ก', '์ด', '##์ค', '์ฐฉ', '##ํ', '๋', '์ธ์ ', '##์ก', '์ด', '๋ฐฐ์ถ', '๋', '๋', '์ ๋จ๋ถ', '์ชฝ', '์ผ๋ก', '๋ด๋ฒฝ', '์', 'ํ', '##์', 'ํ', '๊ฒ', 'ํ์ฑ', 'ํ', '์ฌ์', '๋ด๋ถ', '์', '๋ค', '์ด', '์', '๋', '์ธ์ ', '##์ก', '์', '์', '์ง', '์ง', '์', '์', '๋๋ก', 'ํ', '๋', 'ํฉ์ฑ', '##์ธ', '##์ ', '์ก', '##ํฌ', '์', '๊ดํ', '๊ฒ', '์ด', '๋ค', '.']
|
105 |
+
|
106 |
+
|
107 |
+
### Fine-tuning with KorSci-Bert
|
108 |
+
- [Google Bert](https://github.com/google-research/bert)์ Fine-tuning ๋ฐฉ๋ฒ ์ฐธ๊ณ
|
109 |
+
- Sentence (and sentence-pair) classification tasks: "run_classifier.py" ์ฝ๋ ํ์ฉ
|
110 |
+
- MRC(Machine Reading Comprehension) tasks: "run_squad.py" ์ฝ๋ ํ์ฉ
|