File size: 2,927 Bytes
d03e979
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2cb0a48
 
 
17c9e3e
 
 
2cb0a48
d03e979
 
841de37
d03e979
 
 
 
 
 
838d9f7
d03e979
 
 
 
 
 
 
ea9f65f
d03e979
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
language: es
license: cc-by-4.0
tags:
- spanish
- roberta
- bertin
pipeline_tag: text-classification
widget:
- text: La ciencia nos enseña, en efecto, a someter nuestra razón a la verdad y a conocer y juzgar las cosas como son, es decir, como ellas mismas eligen ser y no como quisiéramos que fueran.
---

# Readability ES Sentences for two classes

Model based on the Roberta architecture finetuned on [BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for readability assessment of Spanish texts.

## Description and performance

This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs binary classification among the following classes:
- Simple.
- Complex.

It achieves a F1 macro average score of 0.8923, measured on the validation set.

## Model variants

- `readability-es-sentences` (this model). Two classes, sentence-based dataset.
- [`readability-es-paragraphs`](https://huggingface.co/hackathon-pln-es/readability-es-paragraphs). Two classes, paragraph-based dataset.
- [`readability-es-3class-sentences`](https://huggingface.co/hackathon-pln-es/readability-es-3class-sentences). Three classes, sentence-based dataset.
- [`readability-es-3class-paragraphs`](https://huggingface.co/hackathon-pln-es/readability-es-3class-paragraphs). Three classes, paragraph-based dataset.

## Datasets

- [`readability-es-hackathon-pln-public`](https://huggingface.co/datasets/hackathon-pln-es/readability-es-hackathon-pln-public), composed of:
  * coh-metrix-esp corpus.
  * Various text resources scraped from websites.
- Other non-public datasets: newsela-es, simplext.

## Training details

Please, refer to [this training run](https://wandb.ai/readability-es/readability-es/runs/3rgvwps0/overview) for full details on hyperparameters and training regime.

## Biases and Limitations

- Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
- One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
- Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
- Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
- No effort has been performed to alleviate the shortcomings and biases described in the [original implementation of BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish#bias-examples-spanish).

## Authors

- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
- [Pedro Cuenca](https://twitter.com/pcuenq)
- [Sergio Morales](https://www.fireblend.com/)
- [Fernando Alva-Manchego](https://feralvam.github.io/)