Foorcee commited on
Commit
0a4048b
·
verified ·
1 Parent(s): 36fd54b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -3
README.md CHANGED
@@ -1,3 +1,158 @@
1
- ---
2
- license: gpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ language:
4
+ - de
5
+ - en
6
+ metrics:
7
+ - bleu
8
+ pipeline_tag: translation
9
+ base_model:
10
+ - google-t5/t5-base
11
+ widget:
12
+ - text: §7Du hast den Nachtsicht Modus §aaktiviert
13
+ output:
14
+ text: §7You §aenabled §7the night-vision mode
15
+ library_name: transformers
16
+ tags:
17
+ - minecraft
18
+ - translation
19
+ - minimessage
20
+ ---
21
+ ## Model Card: Foorcee/t5-minecraft-de-en-base
22
+
23
+ ### Model Overview
24
+ The `t5-minecraft-de-en-base` model is a fine-tuned version of the `google-t5/t5-base` model, specifically designed for translating styled Minecraft messages between German and English. It supports Minecraft's legacy color codes and MiniMessage format, ensuring the preservation of text styling, placeholders, and formatting during translation.
25
+
26
+ ### Key Features
27
+ - **Bidirectional Translation:** Supports translations between German and English.
28
+ - **Color Code Preservation:** Recognizes and maintains Minecraft legacy color codes (`§0` to `§f`, `§k`, `§l`, etc.) during translation.
29
+ - **MiniMessage Support:** Processes MiniMessage tags like `<red>` and maps them to corresponding color codes (e.g., `<red>` → `§c`).
30
+ - **Placeholder Recognition:** Handles placeholders such as `{{count}}` or `{0}`.
31
+ - **Optimized for Styled Text:** Retains the semantic and stylistic relationships between text and associated colors or effects during language translation.
32
+
33
+ ### Technical Details
34
+ - **Base Model:** [google-t5/t5-base](https://huggingface.co/google-t5/t5-base)
35
+ - **Model type:** Language model
36
+ - **Language(s) (NLP):** English, German
37
+ - **Training:** Fine-tuned over 3 epochs with the following configuration:
38
+ - Learning rate: `3e-4`
39
+ - Batch size: `4`
40
+ - Maximum generation length: `256`
41
+ - BF16 precision: `True`
42
+ - **Special Tokens Added:**
43
+ - Legacy color codes: `§0` to `§f`, `§k`, `§l`, `§m`, `§n`, `§o`, `§r`, `§x`, and `§#`
44
+ - MiniMessage and placeholder symbols: `<`, `{`, `}`, `<newline>`
45
+ - German-specific tokens: `Ä`, `Ö`
46
+
47
+ ### Background
48
+ Minecraft uses a JSON structure to define styled text with attributes like colors, bold effects, or underlining. This structure, while functional, is not human-readable. In practice, legacy color codes (`§` followed by a hex digit or character) are commonly used for text styling. The model ensures that these codes or MiniMessage tags are correctly translated alongside text, preserving their semantic and visual meaning.
49
+
50
+ ### Problem Description
51
+ Translating styled Minecraft messages poses unique challenges:
52
+ - Color codes are tied to specific words, and translations often change sentence structures.
53
+ - The model must correctly reassociate colors or effects to words at their new positions after translation.
54
+ - Example:
55
+ - **German Input:** `<gray>Du hast den Nachtsicht Modus <green>aktiviert`
56
+ - **English Output:** `<gray>You <green>enabled the <gray>night-vision mode`
57
+ - The color association must shift as words change positions.
58
+
59
+ ### Usage
60
+
61
+ <details open>
62
+ <summary>Generate a translation</summary>
63
+
64
+ ```python
65
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
66
+
67
+ model = AutoModelForSeq2SeqLM.from_pretrained('Foorcee/t5-minecraft-de-en-base')
68
+ tokenizer = AutoTokenizer.from_pretrained('Foorcee/t5-minecraft-de-en-base')
69
+
70
+
71
+ # Each text input should be started with the task description
72
+ input_texts = ['translate German to English: §7Du hast den Nachtsicht Modus §aaktiviert']
73
+
74
+ # Tokenize the input texts
75
+ input_tokenized = tokenizer(input_texts, max_length=256, padding=True, truncation=True, return_tensors='pt')
76
+
77
+ outputs = model.generate(input_ids=input_tokenized["input_ids"], attention_mask=input_tokenized["attention_mask"], max_length=256)
78
+
79
+ decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
80
+ print(decoded) #§7You §aenabled §7the night-vision mode
81
+ ```
82
+ </details>
83
+
84
+ <details>
85
+ <summary>Complete a sentence</summary>
86
+
87
+ ```python
88
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
89
+
90
+ model = AutoModelForSeq2SeqLM.from_pretrained('Foorcee/t5-minecraft-de-en-base')
91
+ tokenizer = AutoTokenizer.from_pretrained('Foorcee/t5-minecraft-de-en-base')
92
+
93
+ # Each text input should be started with the task description
94
+ input_texts = ['translate German to English: §7Du hast den Nachtsicht Modus §aaktiviert']
95
+
96
+ # Expected output
97
+ output_context = ['§7You have']
98
+
99
+ # Tokenize the input texts
100
+ input_tokenized = tokenizer(input_texts, max_length=256, padding=True, truncation=True, return_tensors='pt')
101
+ output = tokenizer(output_context, return_tensors="pt", add_special_tokens=False)
102
+
103
+ outputs = model.generate(input_ids=input_tokenized["input_ids"],
104
+ attention_mask=input_tokenized["attention_mask"],
105
+ decoder_input_ids=output["input_ids"],
106
+ max_length=256)
107
+
108
+ decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
109
+ print(decoded) #§7You have §aenabled §7the night-vision mode
110
+ ```
111
+ </details>
112
+
113
+ <details>
114
+ <summary>Compute loss</summary>
115
+
116
+ ```python
117
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
118
+
119
+ model = AutoModelForSeq2SeqLM.from_pretrained('Foorcee/t5-minecraft-de-en-base')
120
+ tokenizer = AutoTokenizer.from_pretrained('Foorcee/t5-minecraft-de-en-base')
121
+
122
+ # Each text input should be started with the task description
123
+ input_texts = ['translate German to English: §7Du hast den Nachtsicht Modus §aaktiviert']
124
+
125
+ # Expected output
126
+ output_context = ['§7You have §aenabled §7the night-vision mode']
127
+
128
+ # Tokenize the input texts
129
+ input = tokenizer(input_texts, max_length=256, padding=True, truncation=True, return_tensors='pt')
130
+ output = tokenizer(output_context, return_tensors="pt", add_special_tokens=False)
131
+
132
+ loss = model(input_ids=input["input_ids"], labels=output["input_ids"], output_hidden_states=True).loss
133
+ print(loss)
134
+ ```
135
+ </details>
136
+
137
+ ### Recommendations for Use
138
+ - Preprocess input by converting all color codes to lowercase (e.g., `§C` → `§c`).
139
+ - Replace newline characters with the `<newline>` special token for consistency.
140
+
141
+ ### Supported Tasks
142
+ 1. **German to English Translation:** Translates styled Minecraft text from German to English.
143
+ 2. **English to German Translation:** Translates styled Minecraft text from English to German.
144
+
145
+ ### Evaluation
146
+ - **Metrics:** BLEU score was used for evaluation.
147
+ - **Training Loss:** `0.7215`
148
+ - **Evaluation Loss:** `0.5136`
149
+ - **Evaluation BLEU Score:** `0.7229`
150
+
151
+ ### Limitations
152
+ - The model is fine-tuned for Minecraft-specific messages and may not generalize well to non-Minecraft-related translations.
153
+ - Currently, the model supports translations only between German and English. The model may not handle languages other than German and English.
154
+
155
+ ### Additional Information
156
+ - **Minecraft Raw JSON Text Format**: [Documentation](https://minecraft.wiki/w/Raw_JSON_text_format)
157
+ - **Minecraft Formatting Codes**: [Formatting Codes Documentation](https://minecraft.fandom.com/wiki/Formatting_codes)
158
+ - **MiniMessage Format**: [MiniMessage Documentation](https://docs.advntr.dev/minimessage/index.html), [MiniMessage Web-UI](https://webui.advntr.dev/)