--- language: "cs" tags: - Czech - KKY - FAV - RoBERTa license: "cc-by-nc-sa-4.0" --- # FERNET-C5-RoBERTa FERNET-C5-RoBERTa (FERNET stands for **F**lexible **E**mbedding **R**epresentation **NET**work) is a monolingual Czech RoBERTa-base model pre-trained from Czech Colossal Clean Crawled Corpus (C5). It is a successor of the BERT model [fav-kky/FERNET-C5](https://huggingface.co/fav-kky/FERNET-C5). See our paper for details. ## How to use You can use this model directly with a pipeline for masked language modeling: ```python >>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='fav-kky/FERNET-C5-RoBERTa') >>> unmasker("Ahoj, jsem jazykový model a hodím se třeba pro práci s .") [{'score': 0.13343162834644318, 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s textem.', 'token': 33582, 'token_str': ' textem'}, {'score': 0.12583224475383759, 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s ' 'počítačem.', 'token': 32837, 'token_str': ' počítačem'}, {'score': 0.0796666219830513, 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s obrázky.', 'token': 15876, 'token_str': ' obrázky'}, {'score': 0.06347835063934326, 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s lidmi.', 'token': 5426, 'token_str': ' lidmi'}, {'score': 0.050984010100364685, 'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s dětmi.', 'token': 5468, 'token_str': ' dětmi'}] ``` Here is how to use this model to get the features of a given text in PyTorch: ```python from transformers import RobertaTokenizer, RobertaModel tokenizer = RobertaTokenizer.from_pretrained('fav-kky/FERNET-C5-RoBERTa') model = RobertaModel.from_pretrained('fav-kky/FERNET-C5-RoBERTa') text = "Libovolný text." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) ``` ## Training data The model was pretrained on the mix of three text sources: - Czech web pages extracted from the Common Crawl project (93GB), - self-crawled Czech news dataset (20GB), - Czech part Wikipedia (1GB). The model was pretrained for 500k steps (over 15 epochs over the full dataset) with a peak learning rate of 4e-4. ## Paper https://link.springer.com/chapter/10.1007/978-3-030-89579-2_3 The preprint of our paper is available at https://arxiv.org/abs/2107.10042. ## Citation If you find this model useful, please cite our related paper: ``` @inproceedings{FERNETC5, title = {Comparison of Czech Transformers on Text Classification Tasks}, author = {Lehe{\v{c}}ka, Jan and {\v{S}}vec, Jan}, year = 2021, booktitle = {Statistical Language and Speech Processing}, publisher = {Springer International Publishing}, address = {Cham}, pages = {27--37}, doi = {10.1007/978-3-030-89579-2_3}, isbn = {978-3-030-89579-2}, editor = {Espinosa-Anke, Luis and Mart{\'i}n-Vide, Carlos and Spasi{\'{c}}, Irena} } ```