File size: 6,950 Bytes
0a4048b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5666772
0a4048b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
license: gpl-3.0
language:
- de
- en
metrics:
- bleu
pipeline_tag: translation
base_model:
- google-t5/t5-base
widget:
- text: §7Du hast den Nachtsicht Modus §aaktiviert
  output: 
    text: §7You §aenabled §7the night-vision mode
library_name: transformers
tags:
- minecraft
- translation
- minimessage
---
## Model Card: Foorcee/t5-minecraft-de-en-base

### Model Overview
The `t5-minecraft-de-en-base` model is a fine-tuned version of the `google-t5/t5-base` model, specifically designed for translating styled Minecraft messages between German and English. It supports Minecraft's legacy color codes and MiniMessage format, ensuring the preservation of text styling, placeholders, and formatting during translation.

### Key Features
- **Bidirectional Translation:** Supports translations between German and English.
- **Color Code Preservation:** Recognizes and maintains Minecraft legacy color codes (`§0` to `§f`, `§k`, `§l`, etc.) during translation.
- **MiniMessage Support:** Treats MiniMessage tags like `<red>` (e.g., `<red>``§c`).
- **Placeholder Recognition:** Handles placeholders such as `{{count}}` or `{0}`.
- **Optimized for Styled Text:** Retains the semantic and stylistic relationships between text and associated colors or effects during language translation.

### Technical Details
- **Base Model:** [google-t5/t5-base](https://huggingface.co/google-t5/t5-base)
- **Model type:** Language model
- **Language(s) (NLP):** English, German
- **Training:** Fine-tuned over 3 epochs with the following configuration:
  - Learning rate: `3e-4`
  - Batch size: `4`
  - Maximum generation length: `256`
  - BF16 precision: `True`
- **Special Tokens Added:**
  - Legacy color codes: `§0` to `§f`, `§k`, `§l`, `§m`, `§n`, `§o`, `§r`, `§x`, and `§#`
  - MiniMessage and placeholder symbols: `<`, `{`, `}`, `<newline>`
  - German-specific tokens: `Ä`, `Ö`

### Background
Minecraft uses a JSON structure to define styled text with attributes like colors, bold effects, or underlining. This structure, while functional, is not human-readable. In practice, legacy color codes (`§` followed by a hex digit or character) are commonly used for text styling. The model ensures that these codes or MiniMessage tags are correctly translated alongside text, preserving their semantic and visual meaning.

### Problem Description
Translating styled Minecraft messages poses unique challenges:
- Color codes are tied to specific words, and translations often change sentence structures.
- The model must correctly reassociate colors or effects to words at their new positions after translation.
- Example:
  - **German Input:** `<gray>Du hast den Nachtsicht Modus <green>aktiviert`
  - **English Output:** `<gray>You <green>enabled the <gray>night-vision mode`
  - The color association must shift as words change positions.

### Usage

<details open>
  <summary>Generate a translation</summary>

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained('Foorcee/t5-minecraft-de-en-base')
tokenizer = AutoTokenizer.from_pretrained('Foorcee/t5-minecraft-de-en-base')


# Each text input should be started with the task description
input_texts = ['translate German to English: §7Du hast den Nachtsicht Modus §aaktiviert']

# Tokenize the input texts
input_tokenized = tokenizer(input_texts, max_length=256, padding=True, truncation=True, return_tensors='pt')

outputs = model.generate(input_ids=input_tokenized["input_ids"], attention_mask=input_tokenized["attention_mask"], max_length=256)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) #§7You §aenabled §7the night-vision mode
```
</details>

<details>
  <summary>Complete a sentence</summary>

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained('Foorcee/t5-minecraft-de-en-base')
tokenizer = AutoTokenizer.from_pretrained('Foorcee/t5-minecraft-de-en-base')

# Each text input should be started with the task description
input_texts = ['translate German to English: §7Du hast den Nachtsicht Modus §aaktiviert']

# Expected output
output_context = ['§7You have']

# Tokenize the input texts
input_tokenized = tokenizer(input_texts, max_length=256, padding=True, truncation=True, return_tensors='pt')
output = tokenizer(output_context, return_tensors="pt", add_special_tokens=False)

outputs = model.generate(input_ids=input_tokenized["input_ids"],
                         attention_mask=input_tokenized["attention_mask"],
                         decoder_input_ids=output["input_ids"],
                         max_length=256)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) #§7You have §aenabled §7the night-vision mode
```
</details>

<details>
  <summary>Compute loss</summary>

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained('Foorcee/t5-minecraft-de-en-base')
tokenizer = AutoTokenizer.from_pretrained('Foorcee/t5-minecraft-de-en-base')

# Each text input should be started with the task description
input_texts = ['translate German to English: §7Du hast den Nachtsicht Modus §aaktiviert']

# Expected output
output_context = ['§7You have §aenabled §7the night-vision mode']

# Tokenize the input texts
input = tokenizer(input_texts, max_length=256, padding=True, truncation=True, return_tensors='pt')
output = tokenizer(output_context, return_tensors="pt", add_special_tokens=False)

loss = model(input_ids=input["input_ids"], labels=output["input_ids"], output_hidden_states=True).loss 
print(loss)
```
</details>

### Recommendations for Use
- Preprocess input by converting all color codes to lowercase (e.g., `§C``§c`).
- Replace newline characters with the `<newline>` special token for consistency.

### Supported Tasks
1. **German to English Translation:** Translates styled Minecraft text from German to English.
2. **English to German Translation:** Translates styled Minecraft text from English to German.

### Evaluation
- **Metrics:** BLEU score was used for evaluation.
- **Training Loss:** `0.7215`
- **Evaluation Loss:** `0.5136`
- **Evaluation BLEU Score:** `0.7229`

### Limitations
- The model is fine-tuned for Minecraft-specific messages and may not generalize well to non-Minecraft-related translations.
- Currently, the model supports translations only between German and English. The model may not handle languages other than German and English.

### Additional Information
- **Minecraft Raw JSON Text Format**: [Documentation](https://minecraft.wiki/w/Raw_JSON_text_format)  
- **Minecraft Formatting Codes**: [Formatting Codes Documentation](https://minecraft.fandom.com/wiki/Formatting_codes)  
- **MiniMessage Format**: [MiniMessage Documentation](https://docs.advntr.dev/minimessage/index.html), [MiniMessage Web-UI](https://webui.advntr.dev/)