conan1024hao commited on
Commit
fea7bb4
·
1 Parent(s): 14a3793

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -1
README.md CHANGED
@@ -22,7 +22,12 @@ from transformers import AutoTokenizer, AutoModelForMaskedLM
22
  tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
23
  model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
24
  ```
25
- You don't need any text segmentation when you fine-tune downstream tasks. (Though you may obtain better results if you apply morphological analysis to the data before fine-tuning.)
 
 
 
 
 
26
 
27
  ### Tokenization
28
  We use character-based tokenization with whole-word-masking strategy.
 
22
  tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
23
  model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
24
  ```
25
+ Before you fine-tune downstream tasks, you don't need any text segmentation. (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning.)
26
+
27
+ ### Morphological analysis tools
28
+ - ZH: For Chinese, we use [LTP](https://github.com/HIT-SCIR/ltp).
29
+ - JA: For Japanese, we use [Juman++](https://github.com/ku-nlp/jumanpp).
30
+ - KO: For Korean, we use [KoNLPy](https://github.com/konlpy/konlpy)(Kkma class).
31
 
32
  ### Tokenization
33
  We use character-based tokenization with whole-word-masking strategy.