File size: 2,407 Bytes
d6e8550 617cb07 d6e8550 617cb07 4e1fe68 617cb07 4e1fe68 e84d086 4e1fe68 e84d086 4e1fe68 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
---
license: cc-by-nc-sa-4.0
datasets:
- wi_locness
- matejklemen/falko_merlin
- paws
- paws-x
- asset
language:
- en
- de
- es
- ar
- ja
- ko
- zh
metrics:
- bleu
- rouge
- sari
- accuracy
library_name: transformers
---
# Model Card for mEdIT-xxl
The `medit-xxl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-13b-lora` model on the mEdIT dataset.
**Paper:** mEdIT: Multilingual Text Editing via Instruction Tuning
**Authors:** Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar
## Model Details
### Model Description
- **Language(s) (NLP)**: Arabic, Chinese, English, German, Japanese, Korean, Spanish
- **Finetuned from model:** `MBZUAI/bactrian-x-llama-13b-lora`
### Model Sources
- **Repository:** https://github.com/vipulraheja/medit
- **Paper:** TBA
## How to use
### Instruction format
Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.
```
instruction_tokens = [
"Instruction",
"Anweisung",
...
]
input_tokens = [
"Input",
"Aporte",
...
]
output_tokens = [
"Output",
"Produzione",
...
]
task_descriptions = [
"Fix grammatical errors in this sentence", # <-- GEC task
"Umschreiben Sie den Satz", # <-- Paraphrasing
...
]
The entire list of possible instruction, input, output tokens, and task descriptions can be found in the Appendix of our paper.
prompt_template = """### <instruction_token>:\n<task description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""
Note that the tokens and the task description need not be in the language of the input.
```
### Run the model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "grammarly/medit-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# English GEC
prompt = '### ε½δ»€:\nζη« γζζ³ηγ«γγ\n### ε
₯ε:\nI has small cat ,\n### εΊε:\n\n'
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)
# --> I have a small cat ,
# German GEC
prompt = '### ε½δ»€:\nζη« γζζ³ηγ«γγ\n### ε
₯ε:\nIch haben eines kleines Katze ,\n### εΊε:\n\n'
# ...
# --> Ich habe eine kleine Katze ,
```
|