|
--- |
|
language: |
|
- ja |
|
- zh |
|
- ko |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- wikipedia |
|
mask_token: "[MASK]" |
|
widget: |
|
- text: "早稲田大学で自然言語処理を[MASK]ぶ。" |
|
- text: "李白是[MASK]朝人。" |
|
- text: "불고기[MASK] 먹겠습니다." |
|
--- |
|
|
|
### Model description |
|
- This model was trained on **ZH, JA, KO**'s Wikipedia (5 epochs). |
|
|
|
### How to use |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small") |
|
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small") |
|
``` |
|
- Before you fine-tune downstream tasks, you don't need any text segmentation. |
|
- (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning) |
|
|
|
### Morphological analysis tools |
|
- ZH: For Chinese, we use [LTP](https://github.com/HIT-SCIR/ltp). |
|
- JA: For Japanese, we use [Juman++](https://github.com/ku-nlp/jumanpp). |
|
- KO: For Korean, we use [KoNLPy](https://github.com/konlpy/konlpy)(Kkma class). |
|
|
|
### Tokenization |
|
- We use character-based tokenization with **whole-word-masking** strategy. |
|
|
|
### Model size |
|
- vocab_size: 15015 |
|
- num_hidden_layers: 4 |
|
- hidden_size: 512 |
|
- num_attention_heads: 8 |
|
- param_num: 25M |