|
--- |
|
language: "cs" |
|
tags: |
|
- Czech |
|
- KKY |
|
- FAV |
|
- RoBERTa |
|
license: "cc-by-nc-sa-4.0" |
|
--- |
|
|
|
# FERNET-C5-RoBERTa |
|
FERNET-C5-RoBERTa (FERNET stands for **F**lexible **E**mbedding **R**epresentation **NET**work) is a monolingual Czech RoBERTa-base model pre-trained from Czech Colossal Clean Crawled Corpus (C5). |
|
It is a successor of the BERT model [fav-kky/FERNET-C5](https://huggingface.co/fav-kky/FERNET-C5). |
|
See our paper for details. |
|
|
|
## How to use |
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='fav-kky/FERNET-C5-RoBERTa') |
|
>>> unmasker("Ahoj, jsem jazykový model a hodím se třeba pro práci s <mask>.") |
|
|
|
[{'score': 0.13343162834644318, |
|
'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s textem.', |
|
'token': 33582, |
|
'token_str': ' textem'}, |
|
{'score': 0.12583224475383759, |
|
'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s ' |
|
'počítačem.', |
|
'token': 32837, |
|
'token_str': ' počítačem'}, |
|
{'score': 0.0796666219830513, |
|
'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s obrázky.', |
|
'token': 15876, |
|
'token_str': ' obrázky'}, |
|
{'score': 0.06347835063934326, |
|
'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s lidmi.', |
|
'token': 5426, |
|
'token_str': ' lidmi'}, |
|
{'score': 0.050984010100364685, |
|
'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s dětmi.', |
|
'token': 5468, |
|
'token_str': ' dětmi'}] |
|
``` |
|
|
|
Here is how to use this model to get the features of a given text in PyTorch: |
|
|
|
```python |
|
from transformers import RobertaTokenizer, RobertaModel |
|
tokenizer = RobertaTokenizer.from_pretrained('fav-kky/FERNET-C5-RoBERTa') |
|
model = RobertaModel.from_pretrained('fav-kky/FERNET-C5-RoBERTa', add_pooling_layer=False) |
|
text = "Libovolný text." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
``` |
|
|
|
## Training data |
|
|
|
The model was pretrained on the mix of three text sources: |
|
- Czech web pages extracted from the Common Crawl project (93GB), |
|
- self-crawled Czech news dataset (20GB), |
|
- Czech part Wikipedia (1GB). |
|
|
|
The model was pretrained for 500k steps (over 15 epochs over the full dataset) with a peak learning rate of 4e-4. |
|
|
|
## Paper |
|
https://link.springer.com/chapter/10.1007/978-3-030-89579-2_3 |
|
|
|
The preprint of our paper is available at https://arxiv.org/abs/2107.10042. |
|
|
|
## Citation |
|
If you find this model useful, please cite our related paper: |
|
``` |
|
@inproceedings{FERNETC5, |
|
title = {Comparison of Czech Transformers on Text Classification Tasks}, |
|
author = {Lehe{\v{c}}ka, Jan and {\v{S}}vec, Jan}, |
|
year = 2021, |
|
booktitle = {Statistical Language and Speech Processing}, |
|
publisher = {Springer International Publishing}, |
|
address = {Cham}, |
|
pages = {27--37}, |
|
doi = {10.1007/978-3-030-89579-2_3}, |
|
isbn = {978-3-030-89579-2}, |
|
editor = {Espinosa-Anke, Luis and Mart{\'i}n-Vide, Carlos and Spasi{\'{c}}, Irena} |
|
} |
|
``` |