Text Generation
Transformers
Inference Endpoints
File size: 2,407 Bytes
d6e8550
 
617cb07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6e8550
617cb07
 
 
4e1fe68
617cb07
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e1fe68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e84d086
 
4e1fe68
 
 
 
 
 
e84d086
 
 
 
 
 
 
 
 
4e1fe68
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
license: cc-by-nc-sa-4.0
datasets:
- wi_locness
- matejklemen/falko_merlin
- paws
- paws-x
- asset
language:
- en
- de
- es
- ar
- ja
- ko
- zh
metrics:
- bleu
- rouge
- sari
- accuracy
library_name: transformers
---

# Model Card for mEdIT-xxl

The `medit-xxl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-13b-lora` model on the mEdIT dataset.

**Paper:** mEdIT: Multilingual Text Editing via Instruction Tuning

**Authors:** Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar

## Model Details

### Model Description

- **Language(s) (NLP)**: Arabic, Chinese, English, German, Japanese, Korean, Spanish
- **Finetuned from model:** `MBZUAI/bactrian-x-llama-13b-lora`

### Model Sources

- **Repository:** https://github.com/vipulraheja/medit
- **Paper:** TBA

## How to use

### Instruction format

Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.


```
instruction_tokens = [
    "Instruction",
    "Anweisung",
    ...
]

input_tokens = [
    "Input",
    "Aporte",
    ...
]

output_tokens = [
    "Output",
    "Produzione",
    ...
]

task_descriptions = [
    "Fix grammatical errors in this sentence",  # <-- GEC task
    "Umschreiben Sie den Satz",                 # <-- Paraphrasing
    ...
]

The entire list of possible instruction, input, output tokens, and task descriptions can be found in the Appendix of our paper.


prompt_template = """### <instruction_token>:\n<task description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""

Note that the tokens and the task description need not be in the language of the input.
```

### Run the model

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "grammarly/medit-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

# English GEC
prompt = '### 命什:\nζ–‡η« γ‚’ζ–‡ζ³•ηš„γ«γ™γ‚‹\n### ε…₯εŠ›:\nI has small cat ,\n### ε‡ΊεŠ›:\n\n'

inputs = tokenizer(prompt, return_tensors='pt')

outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True)

# --> I have a small cat ,

# German GEC

prompt = '### 命什:\nζ–‡η« γ‚’ζ–‡ζ³•ηš„γ«γ™γ‚‹\n### ε…₯εŠ›:\nIch haben eines kleines Katze ,\n### ε‡ΊεŠ›:\n\n'

# ...
# --> Ich habe eine kleine Katze ,
```