Update README.md
Browse files
README.md
CHANGED
@@ -1,43 +1,52 @@
|
|
1 |
-
---
|
2 |
-
language:
|
3 |
-
- tr
|
4 |
-
tags:
|
5 |
-
- roberta
|
6 |
-
license: cc-by-nc-sa-4.0
|
7 |
-
datasets:
|
8 |
-
- oscar
|
9 |
-
---
|
10 |
-
|
11 |
-
# RoBERTa Turkish medium Morph-level 16k (uncased)
|
12 |
-
|
13 |
-
Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased.
|
14 |
-
The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned.
|
15 |
-
|
16 |
-
Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Morph-level, which means that text is split according to a Turkish morphological analyzer (Zemberek). Vocabulary size is 16.7k.
|
17 |
-
|
18 |
-
## Note that this model needs a preprocessing step before running, because the tokenizer file is not a morphological anaylzer. That is, the test dataset can not be split into morphemes with the tokenizer file. The user needs to process any test dataset by a Turkish morphological analyzer (Zemberek in this case) before running evaluation.
|
19 |
-
|
20 |
-
The details can be found at this paper:
|
21 |
-
https://arxiv.org
|
22 |
-
|
23 |
-
The following code can be used for model loading and tokenization, example max length (514) can be changed:
|
24 |
-
```
|
25 |
-
model = AutoModel.from_pretrained([model_path])
|
26 |
-
#for sequence classification:
|
27 |
-
#model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
|
28 |
-
|
29 |
-
tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
|
30 |
-
tokenizer.mask_token = "[MASK]"
|
31 |
-
tokenizer.cls_token = "[CLS]"
|
32 |
-
tokenizer.sep_token = "[SEP]"
|
33 |
-
tokenizer.pad_token = "[PAD]"
|
34 |
-
tokenizer.unk_token = "[UNK]"
|
35 |
-
tokenizer.bos_token = "[CLS]"
|
36 |
-
tokenizer.eos_token = "[SEP]"
|
37 |
-
tokenizer.model_max_length = 514
|
38 |
-
```
|
39 |
-
|
40 |
-
### BibTeX entry and citation info
|
41 |
-
```bibtex
|
42 |
-
@
|
43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- tr
|
4 |
+
tags:
|
5 |
+
- roberta
|
6 |
+
license: cc-by-nc-sa-4.0
|
7 |
+
datasets:
|
8 |
+
- oscar
|
9 |
+
---
|
10 |
+
|
11 |
+
# RoBERTa Turkish medium Morph-level 16k (uncased)
|
12 |
+
|
13 |
+
Pretrained model on Turkish language using a masked language modeling (MLM) objective. The model is uncased.
|
14 |
+
The pretrained corpus is OSCAR's Turkish split, but it is further filtered and cleaned.
|
15 |
+
|
16 |
+
Model architecture is similar to bert-medium (8 layers, 8 heads, and 512 hidden size). Tokenization algorithm is Morph-level, which means that text is split according to a Turkish morphological analyzer (Zemberek). Vocabulary size is 16.7k.
|
17 |
+
|
18 |
+
## Note that this model needs a preprocessing step before running, because the tokenizer file is not a morphological anaylzer. That is, the test dataset can not be split into morphemes with the tokenizer file. The user needs to process any test dataset by a Turkish morphological analyzer (Zemberek in this case) before running evaluation.
|
19 |
+
|
20 |
+
The details and performance comparisons can be found at this paper:
|
21 |
+
https://arxiv.org/abs/2204.08832
|
22 |
+
|
23 |
+
The following code can be used for model loading and tokenization, example max length (514) can be changed:
|
24 |
+
```
|
25 |
+
model = AutoModel.from_pretrained([model_path])
|
26 |
+
#for sequence classification:
|
27 |
+
#model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])
|
28 |
+
|
29 |
+
tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
|
30 |
+
tokenizer.mask_token = "[MASK]"
|
31 |
+
tokenizer.cls_token = "[CLS]"
|
32 |
+
tokenizer.sep_token = "[SEP]"
|
33 |
+
tokenizer.pad_token = "[PAD]"
|
34 |
+
tokenizer.unk_token = "[UNK]"
|
35 |
+
tokenizer.bos_token = "[CLS]"
|
36 |
+
tokenizer.eos_token = "[SEP]"
|
37 |
+
tokenizer.model_max_length = 514
|
38 |
+
```
|
39 |
+
|
40 |
+
### BibTeX entry and citation info
|
41 |
+
```bibtex
|
42 |
+
@misc{https://doi.org/10.48550/arxiv.2204.08832,
|
43 |
+
doi = {10.48550/ARXIV.2204.08832},
|
44 |
+
url = {https://arxiv.org/abs/2204.08832},
|
45 |
+
author = {Toraman, Cagri and Yilmaz, Eyup Halit and Şahinuç, Furkan and Ozcelik, Oguzhan},
|
46 |
+
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
|
47 |
+
title = {Impact of Tokenization on Language Models: An Analysis for Turkish},
|
48 |
+
publisher = {arXiv},
|
49 |
+
year = {2022},
|
50 |
+
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
|
51 |
+
}
|
52 |
+
```
|