Model Card for Google-T5-base-Grammatical-Error-Correction-Finetuned-C4-200M-550k

This model is fine-tuned for grammatical error correction (GEC). It helps in generating grammatically correct text from input sentences with diverse types of errors, making it useful for applications in writing enhancement and grammar correction across various domains.

Model Details

Model Description

This model is a fine-tuned version of [Google-T5-base] aimed at providing high-quality prompt generation across diverse topics. It excels in understanding input instructions and generating structured prompt that fit various creative, professional, and instructional needs.

Developed by: Abhinav Sarkar
Shared by: abhinavsarkar
Model type: Causal Language Model
Languages: English
Finetuned from model: Google-T5-base

Uses

Direct Use

This model is suitable for grammar and language correction tools, enhancing writing quality in emails, blogs, social media posts, and more. It is particularly helpful for users seeking to improve their English language grammar and accuracy in various communication formats.

Downstream Use

The model can be integrated into systems that require high-quality text generation and correction, such as:

Grammar and spell-checking software
Educational platforms for language learning
Writing assistance tools for professionals

How to Get Started with the Model

Use the following peices of codes to start using the model:

PreRequisites

!pip install -U sentencepiece transformers torch

Loading the model and its tokenizer

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'abhinavsarkar/Google-T5-base-Grammatical_Error_Correction-Finetuned-C4-200M-550k'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(torch_device)

Inferencing the model

import torch 
def correct_grammar(input_text,num_return_sequences):
  batch = tokenizer([input_text],truncation=True,padding='max_length',max_length=64, return_tensors="pt").to(torch_device)
  translated = model.generate(**batch,max_length=64,num_beams=4, num_return_sequences=num_return_sequences, temperature=1.5)
  tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
  return tgt_text
text = 'He are moving here.'
print(correct_grammar(text, num_return_sequences=2))

Training Details

Training Data

The model was fine-tuned on [abhinavsarkar/C4-200m-550k-Determiner], a subset of C4-200M dataset[https://www.kaggle.com/datasets/felixstahlberg/the-c4-200m-dataset-for-gec] focused on grammatical error correction (GEC) with 200 million examples containing diverse syntactic and semantic structures.

Training Procedure

The model was fine-tuned using the Hugging Face Transformers library, Wandb in a distributed environment(Google Collab).

Training Hyperparameters

Training regime: fp16 mixed precision
Epochs: 2
Batch size: 16
Learning rate: 2e-4

Technical Specifications

Compute Infrastructure

Hardware

The fine-tuning was conducted on a setup involving a single T4 GPUs.

Software

Framework: PyTorch
Libraries: Hugging Face Transformers

More Information

For further details or inquiries, please reach out via LinkedIn or email at [email protected].

Model Card Authors

Abhinav Sarkar

Model Card Contact

[email protected]

abhinavsarkar
/

Google-T5-base-Grammatical_Error_Correction-Finetuned-C4-200M-550k