|
---
|
|
language:
|
|
- multilingual
|
|
- en
|
|
- de
|
|
- nl
|
|
- sv
|
|
- da
|
|
- af
|
|
- fr
|
|
- es
|
|
- it
|
|
- pt
|
|
- ro
|
|
- ru
|
|
- cs
|
|
- pl
|
|
- bg
|
|
- uk
|
|
- id
|
|
- jv
|
|
- ms
|
|
- tl
|
|
- ja
|
|
- zh
|
|
- ko
|
|
- vi
|
|
|
|
license: mit
|
|
pipeline_tag: translation
|
|
---
|
|
# MITRE 466M
|
|
|
|
## Description
|
|
MITRE (Multilingual Translation with Registers) is a multilingual, decoder-only model designed for many-to-many translation tasks.
|
|
The technology, i.e., registering, is introduced in our [paper](https://arxiv.org/abs/2501.02979).
|
|
This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this [repository](https://github.com/zhiqu22/mitre).
|
|
|
|
The model supports direct translation across 552 directions for 24 languages spanning over 5 language families.
|
|
You can use our models directly via the `transformers` libs.
|
|
An alternative version of MITRE with 913M parameters is also available in this [repository](https://huggingface.co/naist-nlp/mitre_913m).
|
|
|
|
|
|
## Usages
|
|
Before get tokenizer, you should run `pip install sentencepiece` at first.
|
|
You can simply call the tokenizer and the model by
|
|
```python
|
|
from transformers import AutoModel, AutoTokenizer
|
|
|
|
# you can switch the name to "naist-nlp/mitre_913m"
|
|
tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_466m", trust_remote_code=True, use_fast=False)
|
|
model = AutoModel.from_pretrained("naist-nlp/mitre_466m", trust_remote_code=True)
|
|
```
|
|
|
|
To locally use this model and check the codes, you can clone this hub, then
|
|
```python
|
|
from mitre_466m.tokenization_mitre import MitreTokenizer
|
|
from mitre_466m.modeling_mitre import MitreForConditionalGeneration
|
|
|
|
tokenizer = MitreTokenizer.from_pretrained("mitre_466m")
|
|
model = MitreForConditionalGeneration.from_pretrained("mitre_466m")
|
|
```
|
|
|
|
After get the objects of the model and the tokenizer, we can do translation.
|
|
```python
|
|
english_text = "I have a red apple."
|
|
chinese_text = "我有一个红苹果。"
|
|
model.half() # recommended
|
|
model.eval()
|
|
|
|
# Translating from one or several sentences to a 'target_language'
|
|
src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
|
|
# Translating from one or several sentences to given languages
|
|
# src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])
|
|
|
|
generated_tokens = model.generate(src_tokens.cuda())
|
|
results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
|
print(results)
|
|
# results
|
|
# de: Ich habe einen roten Apfel.
|
|
# zh: 我有一个红苹果。
|
|
|
|
# For training
|
|
# 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
|
|
# 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
|
|
# because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
|
|
# 3. You can refer to our code for detailed implementation.
|
|
# tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
|
|
# labels = tokenizer.encode_target_tokens_to_labels(chinese_text)
|
|
```
|
|
|
|
## Notes
|
|
We basically follow the style of [M2M](https://huggingface.co/facebook/m2m100_418M), however, we make some necessary improvements to reduce cost in generation.
|
|
You can refer to the codes of 'generate()' in [modeling_mitre.py](https://huggingface.co/naist-nlp/mitre_466m/blob/main/modeling_mitre.py) for much more details.
|
|
Moreover, we have a plan to implement FlashAttention V2 to further boost our models, which will be updated as soon as possible.
|
|
|
|
## Languages covered
|
|
Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
|
|
Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
|
|
Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
|
|
Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
|
|
Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)
|
|
|
|
## BibTeX entry and citation info
|
|
```
|
|
@misc{qu2025registeringsourcetokenstarget,
|
|
title={Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation},
|
|
author={Zhi Qu and Yiran Wang and Jiannan Mao and Chenchen Ding and Hideki Tanaka and Masao Utiyama and Taro Watanabe},
|
|
year={2025},
|
|
eprint={2501.02979},
|
|
archivePrefix={arXiv},
|
|
primaryClass={cs.CL},
|
|
url={https://arxiv.org/abs/2501.02979},
|
|
}
|
|
``` |