scoris-mt-en-lt / README.md
scoris's picture
Update README.md (#1)
e03e052 verified
metadata
license: cc-by-2.5
language:
  - lt
  - en
datasets:
  - scoris/en-lt-merged-data
metrics:
  - sacrebleu

Overview

Scoris logo This is an English-Lithuanian translation model (Seq2Seq). For Lithuanian-English translation check another model scoris-mt-lt-en

Original model: Helsinki-NLP/opus-mt-tc-big-en-lt

Fine-tuned on large merged data set: scoris/en-lt-merged-data (5.4 million sentence pairs)

Trained on 6 epochs.

Made by Scoris team

Evaluation:

EN-LT BLEU
scoris/scoris-mt-en-lt 41.9
Helsinki-NLP/opus-mt-tc-big-en-lt 34.3
Google Translate 30.8
Deepl 32.3

Evaluated on scoris/en-lt-merged-data validation set. Google and Deepl evaluated using a random sample of 1000 sentence pairs.

According to Google BLEU score interpretation is following:

BLEU Score Interpretation
< 10 Almost useless
10 - 19 Hard to get the gist
20 - 29 The gist is clear, but has significant grammatical errors
30 - 40 Understandable to good translations
40 - 50 High quality translations
50 - 60 Very high quality, adequate, and fluent translations
> 60 Quality often better than human

Usage

You can use the model in the following way:

from transformers import MarianMTModel, MarianTokenizer

# Specify the model identifier on Hugging Face Model Hub
model_name = "scoris/scoris-mt-en-lt"

# Load the model and tokenizer from Hugging Face
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

src_text = [
    "Once upon a time there were three bears, who lived together in a house of their own in a wood.",
    "One of them was a little, small wee bear; one was a middle-sized bear, and the other was a great, huge bear.",
    "One day, after they had made porridge for their breakfast, they walked out into the wood while the porridge was cooling.",
    "And while they were walking, a little girl came into the house. "
]

# Tokenize the text and generate translations
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

# Print out the translations
for t in translated:
    print(tokenizer.decode(t, skip_special_tokens=True))

# Result:
# Kažkada buvo trys lokiai, kurie gyveno kartu savame name miške.
# Vienas iš jų buvo mažas, mažas lokys; vienas buvo vidutinio dydžio lokys, o kitas buvo didelis, didžiulis lokys.
# Vieną dieną, pagaminę košės pusryčiams, jie išėjo į mišką, kol košė vėso.
# Jiems einant, į namus atėjo maža mergaitė.