conan1024hao
commited on
Commit
·
fea7bb4
1
Parent(s):
14a3793
Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,12 @@ from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
22 |
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
|
23 |
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
|
24 |
```
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
### Tokenization
|
28 |
We use character-based tokenization with whole-word-masking strategy.
|
|
|
22 |
tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
|
23 |
model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
|
24 |
```
|
25 |
+
Before you fine-tune downstream tasks, you don't need any text segmentation. (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning.)
|
26 |
+
|
27 |
+
### Morphological analysis tools
|
28 |
+
- ZH: For Chinese, we use [LTP](https://github.com/HIT-SCIR/ltp).
|
29 |
+
- JA: For Japanese, we use [Juman++](https://github.com/ku-nlp/jumanpp).
|
30 |
+
- KO: For Korean, we use [KoNLPy](https://github.com/konlpy/konlpy)(Kkma class).
|
31 |
|
32 |
### Tokenization
|
33 |
We use character-based tokenization with whole-word-masking strategy.
|