File size: 1,222 Bytes
cdd43b4
 
 
 
 
a1ef4b6
cdd43b4
 
 
 
 
 
 
 
 
c134f4c
 
 
 
 
 
 
 
 
fea7bb4
 
 
 
 
 
c134f4c
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
---
language: 
  - ja
  - zh
  - ko
license: cc-by-sa-4.0
datasets:
- wikipedia
mask_token: "[MASK]"
widget:
- text: "早稲田大学で自然言語処理を[MASK]ぶ。"
- text: "李白是[MASK]朝人。"
- text: "불고기[MASK] 먹겠습니다."
---

### Model description
This model was trained on ZH, JA, KO's Wikipedia (5 epochs).

### How to use
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
```
Before you fine-tune downstream tasks, you don't need any text segmentation. (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning.)

### Morphological analysis tools
- ZH: For Chinese, we use [LTP](https://github.com/HIT-SCIR/ltp).
- JA: For Japanese, we use [Juman++](https://github.com/ku-nlp/jumanpp).
- KO: For Korean, we use [KoNLPy](https://github.com/konlpy/konlpy)(Kkma class).

### Tokenization
We use character-based tokenization with whole-word-masking strategy.

### Model size
- vocab_size: 15015
- num_hidden_layers: 4
- hidden_size: 512
- num_attention_heads: 8
- param_num: 25M