You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

XLSR-TIMIT-B0: Fine-tuned on TIMIT for Phonemic Transcription

This model leverages the pretrained checkpoint ginic/data_seed_4_wav2vec2-large-xlsr-buckeye-ipa and is fine-tuned on the TIMIT Darpa English Corpus to transcribe audio into phonemic representations for the English language.

All code is available on Github

This model outperforms all current xlsr IPA transcription models for English

Performance

  • Training Loss: 1.254
  • Validation Loss: 0.267
  • Test Results (TIMIT test set):
    • Average Weighted Distance: 13.309375
    • Standard Deviation (Weighted Distance): 9.87
    • Average Character Error Rate (CER): 0.113
    • Standard Deviation (CER): 0.06

image/png

Model Information

  • Number of Epochs: 40
  • Learning Rate: 8e-5
  • Optimizer: Adam
  • Datasets Used: TIMIT, Darpa English Corpus

Example Outputs

  1. Prediction: lizteɪkðɪsdɹɾiteɪbklɔθiðiklinizfɹmi
    Ground Truth: lizteɪkðɪsdɹɾiteɪbəklɔtiðiklinizfɹmi
    Weighted Feature Edit Distance: 7.875
    CER: 0.0556

  2. Prediction: ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiɾimpɛɾikoʊts
    Ground Truth: ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiŋinpɛɾikoʊts
    Weighted Feature Edit Distance: 2.375
    CER: 0.0588

Limitations

This phonemic transcription model is fine-tuned on an English speech corpus that does not encompass all dialects and languages. We acknowledge that it may significantly underperform for any unseen languages. We aim to release models and datasets that better serve all populations and languages in the future.


Usage

To transcribe audio files, this model can be used as follows:

from transformers import AutoModelForCTC, AutoProcessor
import torch

# Load model and processor
model = AutoModelForCTC.from_pretrained("KoelLabs/xlsr-timit-b0")
processor = AutoProcessor.from_pretrained("KoelLabs/xlsr-timit-b0")

# Prepare input
audio_input = "path_to_your_audio_file.wav"  # Replace with your file
input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values

# Retrieve logits
with torch.no_grad():
    logits = model(input_values).logits

# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)
Downloads last month
370
Safetensors
Model size
315M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .