File size: 3,058 Bytes
655d635
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b32053a
655d635
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
language: "cs"
tags:
- Czech
- KKY
- FAV
- RoBERTa
license: "cc-by-nc-sa-4.0"
---

# FERNET-C5-RoBERTa
FERNET-C5-RoBERTa (FERNET stands for **F**lexible **E**mbedding **R**epresentation **NET**work) is a monolingual Czech RoBERTa-base model pre-trained from Czech Colossal Clean Crawled Corpus (C5).
It is a successor of the BERT model [fav-kky/FERNET-C5](https://huggingface.co/fav-kky/FERNET-C5).
See our paper for details.

## How to use

You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='fav-kky/FERNET-C5-RoBERTa')
>>> unmasker("Ahoj, jsem jazykový model a hodím se třeba pro práci s <mask>.")

[{'score': 0.13343162834644318,
  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s textem.',
  'token': 33582,
  'token_str': ' textem'},
 {'score': 0.12583224475383759,
  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s '
              'počítačem.',
  'token': 32837,
  'token_str': ' počítačem'},
 {'score': 0.0796666219830513,
  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s obrázky.',
  'token': 15876,
  'token_str': ' obrázky'},
 {'score': 0.06347835063934326,
  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s lidmi.',
  'token': 5426,
  'token_str': ' lidmi'},
 {'score': 0.050984010100364685,
  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s dětmi.',
  'token': 5468,
  'token_str': ' dětmi'}]
```

Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('fav-kky/FERNET-C5-RoBERTa')
model = RobertaModel.from_pretrained('fav-kky/FERNET-C5-RoBERTa')
text = "Libovolný text."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

## Training data

The model was pretrained on the mix of three text sources:
- Czech web pages extracted from the Common Crawl project (93GB),
- self-crawled Czech news dataset (20GB),
- Czech part Wikipedia (1GB).

The model was pretrained for 500k steps (over 15 epochs over the full dataset) with a peak learning rate of 4e-4.

## Paper
https://link.springer.com/chapter/10.1007/978-3-030-89579-2_3

The preprint of our paper is available at https://arxiv.org/abs/2107.10042.

## Citation
If you find this model useful, please cite our related paper:
```
@inproceedings{FERNETC5,
	title        = {Comparison of Czech Transformers on Text Classification Tasks},
	author       = {Lehe{\v{c}}ka, Jan and {\v{S}}vec, Jan},
	year         = 2021,
	booktitle    = {Statistical Language and Speech Processing},
	publisher    = {Springer International Publishing},
	address      = {Cham},
	pages        = {27--37},
	doi          = {10.1007/978-3-030-89579-2_3},
	isbn         = {978-3-030-89579-2},
	editor       = {Espinosa-Anke, Luis and Mart{\'i}n-Vide, Carlos and Spasi{\'{c}}, Irena}
}
```