scoris
/

scoris-mt-en-lt

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

scoris-mt-en-lt / README.md

scoris's picture

Update README.md

21e34bb verified 12 months ago

|

2.97 kB

	---
	license: cc-by-2.5
	language:
	- lt
	- en
	datasets:
	- scoris/en-lt-merged-data
	---
	# Overview
	![Scoris logo](https://scoris.lt/logo_smaller.png)
	This is an English-Lithuanian translation model based on [Helsinki-NLP/opus-mt-tc-big-en-lt](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-lt)

	For Lithuanian-English translation check another model [scoris/opus-mt-tc-big-lt-en-scoris-finetuned](https://huggingface.co/scoris/opus-mt-tc-big-lt-en-scoris-finetuned)


	Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs)

	Trained on 3 epochs.

	Made by [Scoris](https://scoris.lt) team

	# Evaluation:
	Tested on scoris/en-lt-merged-data validation set. Metric: sacrebleu

	\| model \| testset \| BLEU \| Gen Len \|
	\|----------\|---------\|-------\|-------\|
	\| scoris/opus-mt-tc-big-lt-en-scoris-finetuned \| scoris/en-lt-merged-data (validation) \| 41.026200 \| 17.449100
	\| Helsinki-NLP/opus-mt-tc-big-lt-en \| scoris/en-lt-merged-data (validation) \| 34.2768 \| 17.6664

	According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following:

	\| BLEU Score \| Interpretation
	\|----------\|---------\|
	\| < 10 \| Almost useless
	\| 10 - 19 \| Hard to get the gist
	\| 20 - 29 \| The gist is clear, but has significant grammatical errors
	\| 30 - 40 \| Understandable to good translations
	\| 40 - 50 \| High quality translations
	\| 50 - 60 \| Very high quality, adequate, and fluent translations
	\| > 60 \| Quality often better than human

	# Usage
	You can use the model in the following way:
	```python
	from transformers import MarianMTModel, MarianTokenizer

	# Specify the model identifier on Hugging Face Model Hub
	model_name = "scoris/opus-mt-tc-big-en-lt-scoris-finetuned"

	# Load the model and tokenizer from Hugging Face
	tokenizer = MarianTokenizer.from_pretrained(model_name)
	model = MarianMTModel.from_pretrained(model_name)

	src_text = [
	"Once upon a time there were three bears, who lived together in a house of their own in a wood.",
	"One of them was a little, small wee bear; one was a middle-sized bear, and the other was a great, huge bear.",
	"One day, after they had made porridge for their breakfast, they walked out into the wood while the porridge was cooling.",
	"And while they were walking, a little girl came into the house. "
	]

	# Tokenize the text and generate translations
	translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

	# Print out the translations
	for t in translated:
	print(tokenizer.decode(t, skip_special_tokens=True))

	# Result:
	# Kažkada buvo trys lokiai, kurie gyveno kartu savame name miške.
	# Vienas iš jų buvo mažas, mažas lokys; vienas buvo vidutinio dydžio lokys, o kitas buvo didelis, didžiulis lokys.
	# Vieną dieną, pagaminę košės pusryčiams, jie išėjo į mišką, kol košė vėso.
	# Jiems einant, į namus atėjo maža mergaitė.
	```