mitre_466m / README.md
zhiqu22
update
497b6fc
metadata
language:
  - multilingual
  - en
  - de
  - nl
  - sv
  - da
  - af
  - fr
  - es
  - it
  - pt
  - ro
  - ru
  - cs
  - pl
  - bg
  - uk
  - id
  - jv
  - ms
  - tl
  - ja
  - zh
  - ko
  - vi
license: mit
pipeline_tag: translation

MITRE 466M

Description

MITRE (Multilingual Translation with Registers) is a multilingual, decoder-only model designed for many-to-many translation tasks.
The technology, i.e., registering, is introduced in our paper.
This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this repository.

The model supports direct translation across 552 directions for 24 languages spanning over 5 language families.
You can use our models directly via the transformers libs.
An alternative version of MITRE with 913M parameters is also available in this repository.

Usages

Before get tokenizer, you should run pip install sentencepiece at first.
You can simply call the tokenizer and the model by

from transformers import AutoModel, AutoTokenizer

# you can switch the name to "naist-nlp/mitre_913m"
tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_466m", trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained("naist-nlp/mitre_466m", trust_remote_code=True)

To locally use this model and check the codes, you can clone this hub, then

from mitre_466m.tokenization_mitre import MitreTokenizer
from mitre_466m.modeling_mitre import MitreForConditionalGeneration

tokenizer = MitreTokenizer.from_pretrained("mitre_466m")
model = MitreForConditionalGeneration.from_pretrained("mitre_466m")

After get the objects of the model and the tokenizer, we can do translation.

english_text = "I have a red apple."
chinese_text = "我有一个红苹果。"
model.half() # recommended
model.eval()

# Translating from one or several sentences to a 'target_language'
src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
# Translating from one or several sentences to given languages
# src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])

generated_tokens = model.generate(src_tokens.cuda())
results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(results)
# results
# de: Ich habe einen roten Apfel.
# zh: 我有一个红苹果。

# For training
# 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
# 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
#    because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
# 3. You can refer to our code for detailed implementation.
# tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
# labels = tokenizer.encode_target_tokens_to_labels(chinese_text)

Notes

We basically follow the style of M2M, however, we make some necessary improvements to reduce cost in generation.
You can refer to the codes of 'generate()' in modeling_mitre.py for much more details.
Moreover, we have a plan to implement FlashAttention V2 to further boost our models, which will be updated as soon as possible.

Languages covered

Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)

BibTeX entry and citation info

@misc{qu2025registeringsourcetokenstarget,
      title={Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation}, 
      author={Zhi Qu and Yiran Wang and Jiannan Mao and Chenchen Ding and Hideki Tanaka and Masao Utiyama and Taro Watanabe},
      year={2025},
      eprint={2501.02979},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.02979}, 
}