NMT for Finno-Ugric Languages

This is an NMT system for translating between Võro, Livonian, North Sami, South Sami as well as Estonian, Finnish, Latvian and English. It was created by fine-tuning Facebook's m2m100-418M on the liv4ever and smugri datasets.

Tokenizer

Four language codes were added to the tokenizer: liv, vro, sma and sme. Currently the m2m100 tokenizer loads the list of languages from a hard-coded list, so it has to be updated after loading; see the code example below.

Usage example

Install the transformers and sentencepiece libraries: pip install sentencepiece transformers


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/m2m100_418M_smugri")
#Fix the language codes in the tokenizer
tokenizer.id_to_lang_token = dict(list(tokenizer.id_to_lang_token.items()) + list(tokenizer.added_tokens_decoder.items()))
tokenizer.lang_token_to_id = dict(list(tokenizer.lang_token_to_id.items()) + list(tokenizer.added_tokens_encoder.items()))
tokenizer.lang_code_to_token = { k.replace("_", ""): k for k in tokenizer.additional_special_tokens }
tokenizer.lang_code_to_id = { k.replace("_", ""): v for k, v in tokenizer.lang_token_to_id.items() }

model = AutoModelForSeq2SeqLM.from_pretrained("tartuNLP/m2m100_418M_smugri")

tokenizer.src_lang = 'liv'

encoded_src = tokenizer("Līvõ kēļ jelāb!", return_tensors="pt")

encoded_out = model.generate(**encoded_src, forced_bos_token_id = tokenizer.get_lang_id("sme"))
print(tokenizer.batch_decode(encoded_out, skip_special_tokens=True))

The output is Livčča giella eallá.

Downloads last month
9
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.