Regional bengali text to IPA transcription - umt5-base

This is a fine-tuned version of the google/umt5-base for the task of generating IPA transcriptions from regional bengali text. This was done on the dataset of the competition “ভাষামূল: মুখের ভাষার খোঁজে“ by Bengali.AI.

Scores achieved till now (test scores):

  • Word error rate (wer): 0.02390405721962450
  • Char error rate (cer): 0.01011514943093060

Supported district tokens:

  • Kishoreganj
  • Narail
  • Narsingdi
  • Chittagong
  • Rangpur
  • Tangail

Loading & using the model

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
"""
  The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)

Using the pipeline

# Use a pipeline as a high-level helper
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-umt5base", device=device)
"""
  `texts` must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=512, batch_size=batch_size)

Credits

Done by S M Jishanul Islam, Sadia Ahmmed, Sahid Hossain Mustakim

Downloads last month
2
Safetensors
Model size
592M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including teamapocalypseml/regben2ipa-umt5base