opus-mt-tc-bible-big-bnt-deu_eng_fra_por_spa

Table of Contents

Model Details

Neural machine translation model for translating from Bantu languages (bnt) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-bnt-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-bnt-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
run-deu tatoeba-test-v2021-08-07 0.43836 26.1 1752 10562
run-eng tatoeba-test-v2021-08-07 0.54089 39.4 1703 10041
run-fra tatoeba-test-v2021-08-07 0.46240 26.1 1274 7479
run-spa tatoeba-test-v2021-08-07 0.46496 25.8 963 5167
swa-eng tatoeba-test-v2021-08-07 0.59947 45.9 387 2508
lin-eng flores101-devtest 0.40858 16.9 1012 24721
nso-eng flores101-devtest 0.49866 26.5 1012 24721
sna-fra flores101-devtest 0.40134 14.3 1012 28343
swh-deu flores101-devtest 0.43073 14.2 1012 25094
zul-fra flores101-devtest 0.43723 17.4 1012 28343
zul-por flores101-devtest 0.41886 15.9 1012 26519
bem-eng flores200-devtest 0.42350 18.1 1012 24721
kin-eng flores200-devtest 0.46183 21.9 1012 24721
kin-fra flores200-devtest 0.40139 14.7 1012 28343
lin-eng flores200-devtest 0.42073 18.1 1012 24721
nso-eng flores200-devtest 0.51453 28.4 1012 24721
nso-fra flores200-devtest 0.41065 16.1 1012 28343
nya-eng flores200-devtest 0.44398 20.2 1012 24721
run-eng flores200-devtest 0.42987 18.9 1012 24721
sna-eng flores200-devtest 0.45917 21.1 1012 24721
sna-fra flores200-devtest 0.41153 15.2 1012 28343
sot-eng flores200-devtest 0.51854 26.9 1012 24721
sot-fra flores200-devtest 0.41340 15.8 1012 28343
ssw-eng flores200-devtest 0.44925 20.7 1012 24721
swh-deu flores200-devtest 0.44937 15.6 1012 25094
swh-eng flores200-devtest 0.60107 37.0 1012 24721
swh-fra flores200-devtest 0.50257 23.5 1012 28343
swh-por flores200-devtest 0.49475 22.8 1012 26519
swh-spa flores200-devtest 0.42866 15.3 1012 29199
tsn-eng flores200-devtest 0.45365 19.9 1012 24721
tso-eng flores200-devtest 0.46882 22.8 1012 24721
xho-eng flores200-devtest 0.52500 28.8 1012 24721
xho-fra flores200-devtest 0.44642 18.7 1012 28343
xho-por flores200-devtest 0.42517 16.8 1012 26519
zul-eng flores200-devtest 0.53428 29.5 1012 24721
zul-fra flores200-devtest 0.45383 19.0 1012 28343
zul-por flores200-devtest 0.43537 17.4 1012 26519
bem-eng ntrex128 0.43168 19.1 1997 47673
kin-eng ntrex128 0.46996 20.8 1997 47673
kin-fra ntrex128 0.40765 14.7 1997 53481
kin-spa ntrex128 0.41552 15.9 1997 54107
nde-eng ntrex128 0.42744 17.1 1997 47673
nso-eng ntrex128 0.47231 21.5 1997 47673
nso-spa ntrex128 0.40135 15.2 1997 54107
nya-eng ntrex128 0.47072 23.3 1997 47673
nya-spa ntrex128 0.41006 16.2 1997 54107
ssw-eng ntrex128 0.48682 23.5 1997 47673
ssw-spa ntrex128 0.40839 15.9 1997 54107
swa-deu ntrex128 0.43880 14.1 1997 48761
swa-eng ntrex128 0.58527 35.4 1997 47673
swa-fra ntrex128 0.47344 19.7 1997 53481
swa-por ntrex128 0.46292 19.1 1997 51631
swa-spa ntrex128 0.48780 22.9 1997 54107
tsn-eng ntrex128 0.50413 25.3 1997 47673
tsn-fra ntrex128 0.41912 15.8 1997 53481
tsn-por ntrex128 0.41090 15.3 1997 51631
tsn-spa ntrex128 0.42979 17.7 1997 54107
ven-eng ntrex128 0.43364 18.4 1997 47673
xho-eng ntrex128 0.50778 26.5 1997 47673
xho-fra ntrex128 0.41066 15.1 1997 53481
xho-spa ntrex128 0.42129 16.7 1997 54107
zul-eng ntrex128 0.50361 26.9 1997 47673
zul-fra ntrex128 0.40779 15.0 1997 53481
zul-spa ntrex128 0.41836 16.9 1997 54107
kin-eng tico19-test 0.42280 18.8 2100 56323
lin-eng tico19-test 0.41495 18.4 2100 56323
lug-eng tico19-test 0.43948 22.2 2100 56323
swa-eng tico19-test 0.58126 34.5 2100 56315
swa-fra tico19-test 0.46470 20.5 2100 64661
swa-por tico19-test 0.49374 22.8 2100 62729
swa-spa tico19-test 0.50214 24.5 2100 66563
zul-eng tico19-test 0.55678 32.2 2100 56804
zul-fra tico19-test 0.43797 18.6 2100 64661
zul-por tico19-test 0.45560 19.9 2100 62729
zul-spa tico19-test 0.46505 21.4 2100 66563

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: a0ea3b3
  • port time: Mon Oct 7 21:44:56 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
5
Safetensors
Model size
240M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-bnt-deu_eng_fra_por_spa

Evaluation results