|
--- |
|
thumbnail: "Аn open multilingual readability scoring model TRank" |
|
base_model: "Peltarion/xlm-roberta-longformer-base-4096" |
|
tags: |
|
- arxiv:2406.01835 |
|
- Readability |
|
- Multilingual |
|
- Wikipedia |
|
license: mit |
|
language: |
|
- yi |
|
- xh |
|
- fy |
|
- cy |
|
- vi |
|
- uz |
|
- ug |
|
- ur |
|
- uk |
|
- tr |
|
- th |
|
- te |
|
- ta |
|
- sv |
|
- sw |
|
- su |
|
- es |
|
- so |
|
- sl |
|
- sk |
|
- si |
|
- sd |
|
- sr |
|
- gd |
|
- sa |
|
- ru |
|
- ro |
|
- pa |
|
- pt |
|
- pl |
|
- fa |
|
- ps |
|
- om |
|
- or |
|
- 'no' |
|
- ne |
|
- mn |
|
- mr |
|
- ml |
|
- ms |
|
- mg |
|
- mk |
|
- lt |
|
- lv |
|
- la |
|
- lo |
|
- ky |
|
- ku |
|
- ko |
|
- km |
|
- kk |
|
- kn |
|
- jv |
|
- ja |
|
- it |
|
- ga |
|
- id |
|
- is |
|
- hu |
|
- hi |
|
- he |
|
- ha |
|
- gu |
|
- el |
|
- de |
|
- ka |
|
- gl |
|
- fr |
|
- fi |
|
- tl |
|
- et |
|
- eo |
|
- en |
|
- nl |
|
- da |
|
- cs |
|
- hr |
|
- zh |
|
- ca |
|
- my |
|
- bg |
|
- br |
|
- bs |
|
- bn |
|
- be |
|
- eu |
|
- az |
|
- as |
|
- hy |
|
- ar |
|
- am |
|
- af |
|
- sq |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# Open Multilingual Text Readability Scoring Model (TRank) |
|
|
|
[![DOI:10.48550/arXiv.2406.01835](https://zenodo.org/badge/DOI/10.48550/arXiv.2406.01835.svg)](https://doi.org/10.48550/arXiv.2406.01835) |
|
[![Readability Experiments repo](https://img.shields.io/badge/GitLab-repo-orange)](https://gitlab.wikimedia.org/repos/research/readability-experiments) |
|
|
|
## Overview |
|
|
|
This repository contains an open multilingual readability scoring model TRank, presented in the ACL'24 paper **An Open Multilingual System for Scoring Readability of Wikipedia**. |
|
The model is designed to evaluate the readability of text across multiple languages. |
|
|
|
## Features |
|
|
|
- **Multilingual Support**: Evaluates readability in multiple languages. |
|
- **Pairwise Ranking**: Trained using a Siamese architecture with Margin Ranking Loss to differentiate and rank texts from hardest to simplest. |
|
- **Long Context Window**: Utilizes the Longformer architecture of the base model, supporting inputs up to 4096 tokens. |
|
|
|
## Model Training |
|
|
|
The model training implementation can be found in the [Readability Experiments repo](https://gitlab.wikimedia.org/repos/research/readability-experiments). |
|
|
|
## Usage example |
|
``` |
|
import torch |
|
import torch.nn as nn |
|
from transformers import AutoModel |
|
from huggingface_hub import PyTorchModelHubMixin |
|
from transformers import AutoTokenizer |
|
|
|
# Define the model: |
|
BASE_MODEL = "Peltarion/xlm-roberta-longformer-base-4096" |
|
class ReadabilityModel(nn.Module, PyTorchModelHubMixin): |
|
def __init__(self, model_name=BASE_MODEL): |
|
super(ReadabilityModel, self).__init__() |
|
self.model = AutoModel.from_pretrained(model_name) |
|
self.drop = nn.Dropout(p=0.2) |
|
self.fc = nn.Linear(768, 1) |
|
|
|
def forward(self, ids, mask): |
|
out = self.model(input_ids=ids, attention_mask=mask, |
|
output_hidden_states=False) |
|
out = self.drop(out[1]) |
|
outputs = self.fc(out) |
|
|
|
return outputs |
|
|
|
# Load the model: |
|
model = ReadabilityModel.from_pretrained("trokhymovych/TRank_readability") |
|
|
|
# Load the tokenizer: |
|
tokenizer = AutoTokenizer.from_pretrained("trokhymovych/TRank_readability") |
|
|
|
# Set the model to evaluation mode |
|
model.eval() |
|
# Example input text |
|
input_text = "This is an example sentence to evaluate readability." |
|
# Tokenize the input text |
|
inputs = tokenizer.encode_plus( |
|
input_text, |
|
add_special_tokens=True, |
|
max_length=512, |
|
truncation=True, |
|
padding='max_length', |
|
return_tensors='pt' |
|
) |
|
ids = inputs['input_ids'] |
|
mask = inputs['attention_mask'] |
|
|
|
# Make prediction |
|
with torch.no_grad(): |
|
outputs = model(ids, mask) |
|
readability_score = outputs.item() |
|
|
|
# Print the input text and the readability score |
|
print(f"Input Text: {input_text}") |
|
print(f"Readability Score: {readability_score}") |
|
``` |
|
|
|
|
|
## Citation |
|
Preprint: |
|
``` |
|
@misc{trokhymovych2024openmultilingualscoringreadability, |
|
title={An Open Multilingual System for Scoring Readability of Wikipedia}, |
|
author={Mykola Trokhymovych and Indira Sen and Martin Gerlach}, |
|
year={2024}, |
|
eprint={2406.01835}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2406.01835}, |
|
} |
|
``` |
|
|
|
|
|
|