File size: 5,242 Bytes
3acbfa0 affef11 4b8d9d8 affef11 3acbfa0 4b8d9d8 3acbfa0 4b8d9d8 3acbfa0 affef11 3acbfa0 4b8d9d8 3acbfa0 affef11 fe42c62 3acbfa0 affef11 3acbfa0 4b8d9d8 3acbfa0 affef11 3acbfa0 4b8d9d8 3acbfa0 affef11 3acbfa0 4b8d9d8 affef11 fe42c62 affef11 4b8d9d8 affef11 4b8d9d8 affef11 4b8d9d8 3acbfa0 a737cbe f4fbc63 affef11 35e610d affef11 4b8d9d8 3acbfa0 4b8d9d8 affef11 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
---
library_name: transformers
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- ASR
- Nepali ASR
- OpenSLR Nepali
- Nepali ASR Wav2Vec2
- XLS-R
# model-index:
# - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
# # results: [16.82%, 2.72%]
# results:
# - task: speech_recognition
# metrics:
# - metric: wer
# value: 16.82%
# - metric: cer
# value: 2.72%
# model-index:
# - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
# results:
# - task:
# name: speech_recognition
# metrics:
# - type: wer
# value: 16.82
# - type: cer
# value: 2.72
datasets:
- iamTangsang/OpenSLR54-Nepali-ASR
- mozilla-foundation/common_voice_17_0
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
license: mit
language:
- ne
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Wav2Vec2_XLS-R-300m_Nepali_ASR
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on:
- [Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/)
- [Common Voice Corpus 17.0] (https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
## Model description
The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.
- WER on OpenSLR: 16.82%
- CER on OpenSLR: 2.72%
## Intended uses & limitations
- Research on Nepali ASR
- Transcriptions on Nepali audio
- Further Fine-tuning
- ### Limitations:
- The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
- Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
- Numerals have been filtered out as well.
- The vocabulary doesn't contain all the Nepali alphabets.
- Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
- May struggle with background noises and overlapping speech.
## Training and evaluation data
### Common Voice v17.0
- This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
- Initially, the model was trained on [CommonVoice v17.0 ne-NP](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/ne-NP) which consists of about 2 hours of voice data of which 1 hours have been manually validated.
- We combined the `validated` and `other` split first since the dataset is very small. So, we had a total of 1337 utterances.
- We have preprocessed the data by removing all punctuations and symbols.
- Then, we used 80% of the total utterances for training and 10% for evaluation.
- And, we used the `test` split consisting of 217 utterances for testing. (It might have been present in the `train split` as well.)
- It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.
### OpenSLR Nepali ASR training data
- Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
- Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
- And, segments longer than 5 seconds were removed because of resource limitations.
- Less frequently used 'alphabets' were removed to reduce the vocabulary size.
- Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded [here](https://huggingface.co/datasets/iamTangsang/OpenSLR54-Nepali-ASR).
- 80% was used for training, 10% for evaluation and 10% for testing.
## Training procedure
### Training on CommonVoice 17.0
The following hyperparameters were used during training:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 30
- mixed_precision_training: Native AMP
### Initial Training on OpenSLR-54 for 16 epochs
The following hyperparameters were used:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- warmup_steps: 500
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 16
- mixed_precision_training: Native AMP
### Further Training on OpenSLR-54 for further 3 epochs
We used the following:
- learning_rate: 2e-5
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 700
- num_epochs: 3
- mixed_precision_training: Native AMP
### Framework versions
- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.0
- Tokenizers 0.19.1 |