metadata

license: cc-by-nc-sa-4.0
datasets:
  - wi_locness
  - matejklemen/falko_merlin
  - paws
  - paws-x
  - asset
language:
  - en
  - de
  - es
  - ar
  - ja
  - ko
  - zh
metrics:
  - bleu
  - rouge
  - sari
  - accuracy
library_name: transformers

Model Card for mEdIT-xxl

The medit-xxl model was obtained by fine-tuning the MBZUAI/bactrian-x-llama-13b-lora model on the mEdIT dataset.

Paper: mEdIT: Multilingual Text Editing via Instruction Tuning

Authors: Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar

Model Details

Model Description

Language(s) (NLP): Arabic, Chinese, English, German, Japanese, Korean, Spanish
Finetuned from model: MBZUAI/bactrian-x-llama-13b-lora

Model Sources

Repository: https://github.com/vipulraheja/medit
Paper: TBA

How to use

Instruction format

Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.

instruction_tokens = [
    "Instruction",
    "Anweisung",
    ...
]

input_tokens = [
    "Input",
    "Aporte",
    ...
]

output_tokens = [
    "Output",
    "Produzione",
    ...
]

task_descriptions = [
    "Fix grammatical errors in this sentence",  # <-- GEC task
    "Umschreiben Sie den Satz",                 # <-- Paraphrasing
    ...
]

The entire list of possible instruction, input, output tokens, and task descriptions can be found in the Appendix of our paper.


prompt_template = """### <instruction_token>:\n<task description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""

Note that the tokens and the task description need not be in the language of the input.

Run the model

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "grammarly/medit-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

# English GEC
prompt = '### 命令:\n文章を文法的にする\n### 入力:\nI has small cat ,\n### 出力:\n\n'

inputs = tokenizer(prompt, return_tensors='pt')

outputs = model.generate(**inputs, max_new_tokens=20)

print(tokenizer.decode(outputs[0], skip_special_tokens=True)

# --> I have a small cat ,

# German GEC

prompt = '### 命令:\n文章を文法的にする\n### 入力:\nIch haben eines kleines Katze ,\n### 出力:\n\n'

# ...
# --> Ich habe eine kleine Katze ,