jlehecka commited on
Commit
655d635
·
verified ·
1 Parent(s): a003a26

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "cs"
3
+ tags:
4
+ - Czech
5
+ - KKY
6
+ - FAV
7
+ - RoBERTa
8
+ license: "cc-by-nc-sa-4.0"
9
+ ---
10
+
11
+ # FERNET-C5-RoBERTa
12
+ FERNET-C5-RoBERTa (FERNET stands for **F**lexible **E**mbedding **R**epresentation **NET**work) is a monolingual Czech RoBERTa-base model pre-trained from Czech Colossal Clean Crawled Corpus (C5).
13
+ It is a successor of the BERT model [fav-kky/FERNET-C5](https://huggingface.co/fav-kky/FERNET-C5).
14
+ See our paper for details.
15
+
16
+ ## How to use
17
+
18
+ You can use this model directly with a pipeline for masked language modeling:
19
+
20
+ ```python
21
+ >>> from transformers import pipeline
22
+ >>> unmasker = pipeline('fill-mask', model='fav-kky/FERNET-C5-RoBERTa')
23
+ >>> unmasker("Ahoj, jsem jazykový model a hodím se třeba pro práci s <mask>.")
24
+
25
+ [{'score': 0.13343162834644318,
26
+ 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s textem.',
27
+ 'token': 33582,
28
+ 'token_str': ' textem'},
29
+ {'score': 0.12583224475383759,
30
+ 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s '
31
+ 'počítačem.',
32
+ 'token': 32837,
33
+ 'token_str': ' počítačem'},
34
+ {'score': 0.0796666219830513,
35
+ 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s obrázky.',
36
+ 'token': 15876,
37
+ 'token_str': ' obrázky'},
38
+ {'score': 0.06347835063934326,
39
+ 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s lidmi.',
40
+ 'token': 5426,
41
+ 'token_str': ' lidmi'},
42
+ {'score': 0.050984010100364685,
43
+ 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s dětmi.',
44
+ 'token': 5468,
45
+ 'token_str': ' dětmi'}]
46
+ ```
47
+
48
+ Here is how to use this model to get the features of a given text in PyTorch:
49
+
50
+ ```python
51
+ from transformers import RobertaTokenizer, RobertaModel
52
+ tokenizer = RobertaTokenizer.from_pretrained('fav-kky/FERNET-C5-RoBERTa')
53
+ model = RobertaModel.from_pretrained('fav-kky/FERNET-C5-RoBERTa')
54
+ text = "Libovolný text."
55
+ encoded_input = tokenizer(text, return_tensors='pt')
56
+ output = model(**encoded_input)
57
+ ```
58
+
59
+ ## Training data
60
+
61
+ The model was pretrained on the mix of three text sources:
62
+ - Czech web pages extracted from the Common Crawl project (93GB),
63
+ - self-crawled Czech news dataset (20GB),
64
+ - Czech part Wikipedia (1GB).
65
+
66
+
67
+ ## Paper
68
+ https://link.springer.com/chapter/10.1007/978-3-030-89579-2_3
69
+
70
+ The preprint of our paper is available at https://arxiv.org/abs/2107.10042.
71
+
72
+ ## Citation
73
+ If you find this model useful, please cite our related paper:
74
+ ```
75
+ @inproceedings{FERNETC5,
76
+ title = {Comparison of Czech Transformers on Text Classification Tasks},
77
+ author = {Lehe{\v{c}}ka, Jan and {\v{S}}vec, Jan},
78
+ year = 2021,
79
+ booktitle = {Statistical Language and Speech Processing},
80
+ publisher = {Springer International Publishing},
81
+ address = {Cham},
82
+ pages = {27--37},
83
+ doi = {10.1007/978-3-030-89579-2_3},
84
+ isbn = {978-3-030-89579-2},
85
+ editor = {Espinosa-Anke, Luis and Mart{\'i}n-Vide, Carlos and Spasi{\'{c}}, Irena}
86
+ }
87
+ ```