fav-kky
/

FERNET-C5-RoBERTa

Model card Files Files and versions Community

jlehecka commited on 2 days ago

Commit

655d635

·

verified ·

1 Parent(s): a003a26

Create README.md

Files changed (1) hide show

README.md +87 -0

README.md ADDED Viewed

	@@ -0,0 +1,87 @@

+---
+language: "cs"
+tags:
+- Czech
+- KKY
+- FAV
+- RoBERTa
+license: "cc-by-nc-sa-4.0"
+---
+# FERNET-C5-RoBERTa
+FERNET-C5-RoBERTa (FERNET stands for **F**lexible **E**mbedding **R**epresentation **NET**work) is a monolingual Czech RoBERTa-base model pre-trained from Czech Colossal Clean Crawled Corpus (C5).
+It is a successor of the BERT model [fav-kky/FERNET-C5](https://huggingface.co/fav-kky/FERNET-C5).
+See our paper for details.
+## How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='fav-kky/FERNET-C5-RoBERTa')
+>>> unmasker("Ahoj, jsem jazykový model a hodím se třeba pro práci s <mask>.")
+[{'score': 0.13343162834644318,
+  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s textem.',
+  'token': 33582,
+  'token_str': ' textem'},
+ {'score': 0.12583224475383759,
+  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s '
+              'počítačem.',
+  'token': 32837,
+  'token_str': ' počítačem'},
+ {'score': 0.0796666219830513,
+  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s obrázky.',
+  'token': 15876,
+  'token_str': ' obrázky'},
+ {'score': 0.06347835063934326,
+  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s lidmi.',
+  'token': 5426,
+  'token_str': ' lidmi'},
+ {'score': 0.050984010100364685,
+  'sequence': 'Ahoj, jsem jazykový model a hodím se třeba pro práci s dětmi.',
+  'token': 5468,
+  'token_str': ' dětmi'}]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import RobertaTokenizer, RobertaModel
+tokenizer = RobertaTokenizer.from_pretrained('fav-kky/FERNET-C5-RoBERTa')
+model = RobertaModel.from_pretrained('fav-kky/FERNET-C5-RoBERTa')
+text = "Libovolný text."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+## Training data
+The model was pretrained on the mix of three text sources:
+- Czech web pages extracted from the Common Crawl project (93GB),
+- self-crawled Czech news dataset (20GB),
+- Czech part Wikipedia (1GB).
+## Paper
+https://link.springer.com/chapter/10.1007/978-3-030-89579-2_3
+The preprint of our paper is available at https://arxiv.org/abs/2107.10042.
+## Citation
+If you find this model useful, please cite our related paper:
+```
+@inproceedings{FERNETC5,
+	title        = {Comparison of Czech Transformers on Text Classification Tasks},
+	author       = {Lehe{\v{c}}ka, Jan and {\v{S}}vec, Jan},
+	year         = 2021,
+	booktitle    = {Statistical Language and Speech Processing},
+	publisher    = {Springer International Publishing},
+	address      = {Cham},
+	pages        = {27--37},
+	doi          = {10.1007/978-3-030-89579-2_3},
+	isbn         = {978-3-030-89579-2},
+	editor       = {Espinosa-Anke, Luis and Mart{\'i}n-Vide, Carlos and Spasi{\'{c}}, Irena}
+}
+```