|
--- |
|
license: cc-by-2.5 |
|
language: |
|
- lt |
|
- en |
|
datasets: |
|
- scoris/en-lt-merged-data |
|
--- |
|
# Overview |
|
![Scoris logo](https://scoris.lt/logo_smaller.png) |
|
This is an English-Lithuanian translation model based on [Helsinki-NLP/opus-mt-tc-big-en-lt](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-lt) |
|
|
|
For Lithuanian-English translation check another model [scoris/opus-mt-tc-big-lt-en-scoris-finetuned](https://huggingface.co/scoris/opus-mt-tc-big-lt-en-scoris-finetuned) |
|
|
|
|
|
Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs) |
|
|
|
Trained on 3 epochs. |
|
|
|
Made by [Scoris](https://scoris.lt) team |
|
|
|
# Evaluation: |
|
Tested on scoris/en-lt-merged-data validation set. Metric: sacrebleu |
|
|
|
| model | testset | BLEU | Gen Len | |
|
|----------|---------|-------|-------| |
|
| scoris/opus-mt-tc-big-lt-en-scoris-finetuned | scoris/en-lt-merged-data (validation) | 41.026200 | 17.449100 |
|
| Helsinki-NLP/opus-mt-tc-big-lt-en | scoris/en-lt-merged-data (validation) | 34.2768 | 17.6664 |
|
|
|
According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following: |
|
|
|
| BLEU Score | Interpretation |
|
|----------|---------| |
|
| < 10 | Almost useless |
|
| 10 - 19 | Hard to get the gist |
|
| 20 - 29 | The gist is clear, but has significant grammatical errors |
|
| 30 - 40 | Understandable to good translations |
|
| **40 - 50** | **High quality translations** |
|
| 50 - 60 | Very high quality, adequate, and fluent translations |
|
| > 60 | Quality often better than human |
|
|
|
# Usage |
|
You can use the model in the following way: |
|
```python |
|
from transformers import MarianMTModel, MarianTokenizer |
|
|
|
# Specify the model identifier on Hugging Face Model Hub |
|
model_name = "scoris/opus-mt-tc-big-en-lt-scoris-finetuned" |
|
|
|
# Load the model and tokenizer from Hugging Face |
|
tokenizer = MarianTokenizer.from_pretrained(model_name) |
|
model = MarianMTModel.from_pretrained(model_name) |
|
|
|
src_text = [ |
|
"Once upon a time there were three bears, who lived together in a house of their own in a wood.", |
|
"One of them was a little, small wee bear; one was a middle-sized bear, and the other was a great, huge bear.", |
|
"One day, after they had made porridge for their breakfast, they walked out into the wood while the porridge was cooling.", |
|
"And while they were walking, a little girl came into the house. " |
|
] |
|
|
|
# Tokenize the text and generate translations |
|
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) |
|
|
|
# Print out the translations |
|
for t in translated: |
|
print(tokenizer.decode(t, skip_special_tokens=True)) |
|
|
|
# Result: |
|
# Kažkada buvo trys lokiai, kurie gyveno kartu savame name miške. |
|
# Vienas iš jų buvo mažas, mažas lokys; vienas buvo vidutinio dydžio lokys, o kitas buvo didelis, didžiulis lokys. |
|
# Vieną dieną, pagaminę košės pusryčiams, jie išėjo į mišką, kol košė vėso. |
|
# Jiems einant, į namus atėjo maža mergaitė. |
|
``` |