Model Overview
Description:
STT ES FastConformer Hybrid Transducer-CTC Large transcribes text in upper and lower case Spanish alphabet along with spaces, period, comma, question mark and reverse question mark. This collection contains the Spanish FastConformer Hybrid (Transducer and CTC) Large model (around 115M parameters) with punctuation and capitalization trained on around 3400h hours of Spanish speech. See the model architecture section and NeMo documentation for complete architecture details.
It utilizes a Google SentencePiece [1] tokenizer with a vocabulary size of 1024.
This model is ready for non-commercial use.
NVIDIA NeMo: Training
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest Pytorch version.
pip install nemo_toolkit['all']
How to Use this Model
The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically instantiate the model
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_es_fastconformer_hybrid_large_pc_nc")
Transcribing using Python
Having instantiated the model, simply do:
asr_model.transcribe([path_to_audio_file])
Transcribing many audio files
Using Transducer mode inference:
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
pretrained_name="nvidia/stt_es_fastconformer_hybrid_large_pc_nc"
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
Using CTC mode inference:
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
pretrained_name="nvidia/stt_es_fastconformer_hybrid_large_pc_nc"
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
decoder_type="ctc"
Input
This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given audio sample.
Model Architecture
FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with joint Transducer and CTC decoder loss. You may find more information on the details of FastConformer here: Fast-Conformer Model and about Hybrid Transducer-CTC training here: Hybrid Transducer-CTC.
Training
The NeMo toolkit [3] was used for training the models for over several hundred epochs. The model is trained with this example script and this base config. The tokenizers for these models were built using the text transcripts of the train set with this script.
This model was initialized with the weights of Spanish FastConformer Hybrid (Transducer and CTC) Large P&C model and fine-tuned using the labeled and unlabeled data(with pseudo-labels).
Training Dataset:
The model was trained on around 3400 hours of Spanish speech data.
Mozilla Common Voice 12.0 Portuguese [395]
Data Collection Method: by Human
Labeling Method: by Human
Multilingual Librispeech [780]
Data Collection Method: by Human
Labeling Method: by Human
Voxpopuli [108]
Data Collection Method: by Human
Labeling Method: by Human
Fisher [141]
Data Collection Method: by Human
Labeling Method: by Human
Proprietary corpus [2000h]
Data Collection Method: by Human
Labeling Method: Pseudo-labels
Testing Dataset:
Link:
Performance
Test Hardware: A5000 GPU
The performance of Automatic Speech Recognition models is measuring using Character Error Rate (CER) and Word Error Rate (WER). Table 1 summarizes the performance of the model with the Transducer and CTC decoders across different datasets.
Model | MCV %WER/CER test | MLS %WER/CER test | Voxpopuli %WER/CER test | Fisher %WER/CER test |
---|---|---|---|---|
RNNT head | 7.58/ 1.96 | 12.43 / 2.99 | 9.59 / 3.67 | 30.76 / 11.49 |
CTC head | 8.23 / 2.20 | 12.63 / 3.11 | 9.93 / 3.79 | 31.20 / 11.44 |
Table 2 provides the performance of the model when punctuation marks are separated during evaluation, using both the Transducer and CTC decoders.
Model | MCV %WER/CER test | MLS %WER/CER test | Voxpopuli %WER/CER test | Fisher %WER/CER test |
---|---|---|---|---|
RNNT head | 6.79 / 2.16 | 11.63/ 3.96 | 8.84/ 4.06 | 27.88 / 13.40 |
CTC head | 7.39 / 2.39 | 11.81 / 4.01 | 9.17 / 4.17 | 27.81 / 13.14 |
License/Terms of Use:
The model weights are distributed under a research-friendly non-commercial CC BY-NC 4.0 license
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.
References:
- Downloads last month
- 3