conan1024hao
/

cjkbert-small

Inference Endpoints

Model card Files Files and versions Community

cjkbert-small / README.md

conan1024hao's picture

Update README.md

c2dbdac over 2 years ago

|

1.24 kB

	---
	language:
	- ja
	- zh
	- ko
	license: cc-by-sa-4.0
	datasets:
	- wikipedia
	mask_token: "[MASK]"
	widget:
	- text: "早稲田大学で自然言語処理を[MASK]ぶ。"
	- text: "李白是[MASK]朝人。"
	- text: "불고기[MASK] 먹겠습니다."
	---

	### Model description
	- This model was trained on ZH, JA, KO's Wikipedia (5 epochs).

	### How to use
	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM
	tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
	model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
	```
	- Before you fine-tune downstream tasks, you don't need any text segmentation.
	- (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning.)

	### Morphological analysis tools
	- ZH: For Chinese, we use [LTP](https://github.com/HIT-SCIR/ltp).
	- JA: For Japanese, we use [Juman++](https://github.com/ku-nlp/jumanpp).
	- KO: For Korean, we use [KoNLPy](https://github.com/konlpy/konlpy)(Kkma class).

	### Tokenization
	- We use character-based tokenization with whole-word-masking strategy.

	### Model size
	- vocab_size: 15015
	- num_hidden_layers: 4
	- hidden_size: 512
	- num_attention_heads: 8
	- param_num: 25M