--- language: - multilingual - en - de - nl - sv - da - af - fr - es - it - pt - ro - ru - cs - pl - bg - uk - id - jv - ms - tl - ja - zh - ko - vi license: mit pipeline_tag: translation --- # MITRE 466M ## Description MITRE (Multilingual Translation with Registers) is a multilingual, decoder-only model designed for many-to-many translation tasks. The technology, i.e., registering, is introduced in our [paper](https://arxiv.org/abs/2501.02979). This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this [repository](https://github.com/zhiqu22/mitre). The model supports direct translation across 552 directions for 24 languages spanning over 5 language families. You can use our models directly via the `transformers` libs. An alternative version of MITRE with 913M parameters is also available in this [repository](https://huggingface.co/naist-nlp/mitre_913m). ## Usages Before get tokenizer, you should run `pip install sentencepiece` at first. You can simply call the tokenizer and the model by ```python from transformers import AutoModel, AutoTokenizer # you can switch the name to "naist-nlp/mitre_913m" tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_466m", trust_remote_code=True, use_fast=False) model = AutoModel.from_pretrained("naist-nlp/mitre_466m", trust_remote_code=True) ``` To locally use this model and check the codes, you can clone this hub, then ```python from mitre_466m.tokenization_mitre import MitreTokenizer from mitre_466m.modeling_mitre import MitreForConditionalGeneration tokenizer = MitreTokenizer.from_pretrained("mitre_466m") model = MitreForConditionalGeneration.from_pretrained("mitre_466m") ``` After get the objects of the model and the tokenizer, we can do translation. ```python english_text = "I have a red apple." chinese_text = "我有一个红苹果。" model.half() # recommended model.eval() # Translating from one or several sentences to a 'target_language' src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh") # Translating from one or several sentences to given languages # src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ]) generated_tokens = model.generate(src_tokens.cuda()) results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) print(results) # results # de: Ich habe einen roten Apfel. # zh: 我有一个红苹果。 # For training # 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side. # 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens, # because 'tokenizer.encode_target_tokens_to_input_ids' has pads. # 3. You can refer to our code for detailed implementation. # tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text) # labels = tokenizer.encode_target_tokens_to_labels(chinese_text) ``` ## Notes We basically follow the style of [M2M](https://huggingface.co/facebook/m2m100_418M), however, we make some necessary improvements to reduce cost in generation. You can refer to the codes of 'generate()' in [modeling_mitre.py](https://huggingface.co/naist-nlp/mitre_466m/blob/main/modeling_mitre.py) for much more details. Moreover, we have a plan to implement FlashAttention V2 to further boost our models, which will be updated as soon as possible. ## Languages covered Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af) Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro) Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk) Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl) Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi) ## BibTeX entry and citation info ``` @misc{qu2025registeringsourcetokenstarget, title={Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation}, author={Zhi Qu and Yiran Wang and Jiannan Mao and Chenchen Ding and Hideki Tanaka and Masao Utiyama and Taro Watanabe}, year={2025}, eprint={2501.02979}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.02979}, } ```