--- license: apache-2.0 datasets: - mozilla-foundation/common_voice_15_0 language: - fr metrics: - wer base_model: - LeBenchmark/wav2vec2-FR-7K-large pipeline_tag: automatic-speech-recognition library_name: speechbrain tags: - Transformer - wav2vec2 - CTC - inference --- # asr-wav2vec2-commonvoice-15-fr : LeBenchmark/wav2vec2-FR-7K-large fine-tuned on CommonVoice 15.0 French *asr-wav2vec2-commonvoice-15-fr* is an Automatic Speech Recognition model fine-tuned on CommonVoice 15.0 French set with *LeBenchmark/wav2vec2-FR-7K-large* as the pretrained wav2vec2 model. The fine-tuned model achieves the following performance : | Release | Valid WER | Test WER | GPUs | Epochs |:-------------:|:--------------:|:--------------:| :--------:|:--------:| | 2023-09-08 | 9.14 | 11.21 | 4xV100 32GB | 30 | ## 📝 Model Details The ASR system is composed of: - the **Tokenizer** (char) that transforms the input text into a sequence of characters ("cat" into ["c", "a", "t"]) and trained with the train transcriptions (train.tsv). - the **Acoustic model** (wav2vec2.0 + DNN + CTC greedy decode). The pretrained wav2vec 2.0 model [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large) is combined with two DNN layers and fine-tuned on CommonVoice FR. The final acoustic representation is given to the CTC greedy decode. We used recordings sampled at 16kHz (single channel). ## 💻 How to transcribe a file with the model ### Install and import speechbrain ```bash pip install speechbrain ``` ```python from speechbrain.inference.ASR import EncoderASR ``` ### Pipeline ```python def transcribe(audio, model): return model.transcribe_file(audio).lower() def save_transcript(transcript, audio, output_file): with open(output_file, 'w', encoding='utf-8') as file: file.write(f"{audio}\t{transcript}\n") def main(): model = EncoderASR.from_hparams("Propicto/asr-wav2vec2-commonvoice-15-fr", savedir="tmp/") transcript = transcribe(audio, model) save_transcript(transcript, audio, "out.txt") ``` ## ⚙️ Training Details ### Training Data We use the train / valid / test splits provided by CommonVoice, which corresponds to: | | Train | Valid | Test | |:-------------:|:-------------:|:--------------:|:--------------:| | # utterances | 527,554 | 16,132 | 16,132 | | # hours | 756.19 | 25.84 | 26.11 | ### Training Procedure We follow the training procedure provided in the [ASR-CTC speechbrain recipe](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/ASR/CTC). The `common_voice_prepare.py` script handles the preprocessing of the dataset. #### Training Hyperparameters Refer to the hyperparams.yaml file to get the hyperparameters information. #### Training time With 4xV100 32GB, the training took ~ 81 hours. #### Libraries [Speechbrain](https://speechbrain.github.io/): ```bibtex @misc{SB2021, author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua }, title = {SpeechBrain}, year = {2021}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\\\\url{https://github.com/speechbrain/speechbrain}}, } ``` ## 💡 Information - **Developed by:** Cécile Macaire - **Funded by [optional]:** GENCI-IDRIS (Grant 2023-AD011013625R1) PROPICTO ANR-20-CE93-0005 - **Language(s) (NLP):** French - **License:** Apache-2.0 - **Finetuned from model:** LeBenchmark/wav2vec2-FR-7K-large ## 📌 Citation ```bibtex @inproceedings{macaire24_interspeech, title = {Towards Speech-to-Pictograms Translation}, author = {Cécile Macaire and Chloé Dion and Didier Schwab and Benjamin Lecouteux and Emmanuelle Esperança-Rodier}, year = {2024}, booktitle = {Interspeech 2024}, pages = {857--861}, doi = {10.21437/Interspeech.2024-490}, issn = {2958-1796}, } ```