wav2vec2-large-xls-r-300m-hi
This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the None dataset. It achieves the following results on the evaluation set:
- Loss: 0.3611
- Wer: 0.2992
- Cer: 0.0786
View the results on Kaggle Notebook: https://www.kaggle.com/code/kingabzpro/wav2vec-2-eval
Evaluation
import torch
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import unicodedata
import re
test_dataset = load_dataset("mozilla-foundation/common_voice_8_0", "hi", split="test")
wer = load_metric("wer")
cer = load_metric("cer")
processor = Wav2Vec2Processor.from_pretrained("kingabzpro/wav2vec2-large-xls-r-300m-hi")
model = Wav2Vec2ForCTC.from_pretrained("kingabzpro/wav2vec2-large-xls-r-300m-hi")
model.to("cuda")
# Preprocessing the datasets.
def speech_file_to_array_fn(batch):
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\’\'\|\&\–]'
remove_en = '[A-Za-z]'
batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"].lower())
batch["sentence"] = re.sub(remove_en, "", batch["sentence"]).lower()
batch["sentence"] = unicodedata.normalize("NFKC", batch["sentence"])
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {}".format(wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {}".format(cer.compute(predictions=result["pred_strings"], references=result["sentence"])))
WER: 0.5209850206372026
CER: 0.17902923538230883
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 300
- num_epochs: 100
Training results
Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
---|---|---|---|---|---|
7.0431 | 19.05 | 300 | 3.4423 | 1.0 | 1.0 |
2.3233 | 38.1 | 600 | 0.5965 | 0.4757 | 0.1329 |
0.5676 | 57.14 | 900 | 0.3962 | 0.3584 | 0.0954 |
0.3611 | 76.19 | 1200 | 0.3651 | 0.3190 | 0.0820 |
0.2996 | 95.24 | 1500 | 0.3611 | 0.2992 | 0.0786 |
Framework versions
- Transformers 4.33.0
- Pytorch 2.0.0
- Datasets 2.1.0
- Tokenizers 0.13.3
- Downloads last month
- 7
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for kingabzpro/wav2vec2-large-xls-r-300m-hi
Base model
facebook/wav2vec2-xls-r-300mDataset used to train kingabzpro/wav2vec2-large-xls-r-300m-hi
Evaluation results
- Test WER on Common Voice 15self-reported0.293
- Test CER on Common Voice 15self-reported0.079
- Test WER on Common Voice 8self-reported0.521
- Test CER on Common Voice 8self-reported0.179