asr-wav2vec2-commonvoice-15-fr : LeBenchmark/wav2vec2-FR-7K-large fine-tuned on CommonVoice 15.0 French

asr-wav2vec2-commonvoice-15-fr is an Automatic Speech Recognition model fine-tuned on CommonVoice 15.0 French set with LeBenchmark/wav2vec2-FR-7K-large as the pretrained wav2vec2 model.

The fine-tuned model achieves the following performance :

Release	Valid WER	Test WER	GPUs	Epochs
2023-09-08	9.14	11.21	4xV100 32GB	30

📝 Model Details

The ASR system is composed of:

the Tokenizer (char) that transforms the input text into a sequence of characters ("cat" into ["c", "a", "t"]) and trained with the train transcriptions (train.tsv).
the Acoustic model (wav2vec2.0 + DNN + CTC greedy decode). The pretrained wav2vec 2.0 model LeBenchmark/wav2vec2-FR-7K-large is combined with two DNN layers and fine-tuned on CommonVoice FR. The final acoustic representation is given to the CTC greedy decode.

We used recordings sampled at 16kHz (single channel).

💻 How to transcribe a file with the model

Install and import speechbrain

pip install speechbrain

from speechbrain.inference.ASR import EncoderASR

Pipeline

def transcribe(audio, model):
    return model.transcribe_file(audio).lower()


def save_transcript(transcript, audio, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(f"{audio}\t{transcript}\n")


def main():
    model = EncoderASR.from_hparams("Propicto/asr-wav2vec2-commonvoice-15-fr", savedir="tmp/")
    transcript = transcribe(audio, model)
    save_transcript(transcript, audio, "out.txt")

⚙️ Training Details

Training Data

We use the train / valid / test splits provided by CommonVoice, which corresponds to:

	Train	Valid	Test
# utterances	527,554	16,132	16,132
# hours	756.19	25.84	26.11

Training Procedure

We follow the training procedure provided in the ASR-CTC speechbrain recipe. The common_voice_prepare.py script handles the preprocessing of the dataset.

Training Hyperparameters

Refer to the hyperparams.yaml file to get the hyperparameters information.

Training time

With 4xV100 32GB, the training took ~ 81 hours.

Libraries

Speechbrain:

@misc{SB2021,
    author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua },
    title = {SpeechBrain},
    year = {2021},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\\\\url{https://github.com/speechbrain/speechbrain}},
  }

💡 Information

Developed by: Cécile Macaire
Funded by [optional]: GENCI-IDRIS (Grant 2023-AD011013625R1) PROPICTO ANR-20-CE93-0005
Language(s) (NLP): French
License: Apache-2.0
Finetuned from model: LeBenchmark/wav2vec2-FR-7K-large

📌 Citation

@inproceedings{macaire24_interspeech,
  title     = {Towards Speech-to-Pictograms Translation},
  author    = {Cécile Macaire and Chloé Dion and Didier Schwab and Benjamin Lecouteux and Emmanuelle Esperança-Rodier},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {857--861},
  doi       = {10.21437/Interspeech.2024-490},
  issn      = {2958-1796},
}

Propicto
/

asr-wav2vec2-commonvoice-15-fr