metadata

language:
  - th
tags:
  - automatic-speech-recognition
license: apache-2.0
datasets:
  - common_voice
metrics:
  - wer
  - cer

Thai Wav2Vec2 with CommonVoice V8 (newmm tokenizer)

This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53.

GitHub: https://github.com/wannaphong/thai-wav2vec2-cv-v8

Datasets

It is increase new data from The Common Voice V8 dataset to Common Voice V7 dataset or remove all data in Common Voice V7 dataset before split Common Voice V8 then add CommonVoice V7 dataset back to dataset.

It use ekapolc/Thai_commonvoice_split script for split Common Voice dataset.

You can read more at wannaphong/thai_commonvoice_dataset

Models

This model was finetune wav2vec2-large-xlsr-53 model with Thai Common Voice V8 dataset and It use pre-tokenize with pythainlp.tokenize.word_tokenize.

Training

I used many code from vistec-AI/wav2vec2-large-xlsr-53-th and I fixed bug training code in vistec-AI/wav2vec2-large-xlsr-53-th#2

Evaluation

Test with CommonVoice V8 Testset

Model	WER by newmm (%)	WER by deepcut (%)	CER	URL
wav2vec2 with deepcut	16.354521	11.424476	3.684060	https://github.com/wannaphong/th-cv-v8-wav2vev2-deepcut
wav2vec2 with newmm	16.698299	11.436941	3.737407	https://github.com/wannaphong/thai-wav2vec2-cv-v8
CV v7	17.414503	11.923089	3.854153	https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th

Test with CommonVoice V7 Testset (same test by CV V7)

Model	WER by newmm (%)	WER by deepcut (%)	CER	URL
wav2vec2 with deepcut	12.776381	8.773006	2.628882	https://github.com/wannaphong/th-cv-v8-wav2vev2-deepcut
wav2vec2 with newmm	12.750596	8.672616	2.623341	https://github.com/wannaphong/thai-wav2vec2-cv-v8
CV v7	13.936698	9.347462	2.804787	https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th

This is use same testset from https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th.

source code benchmark: https://github.com/wannaphong/thai-asr-benchmark/tree/main/commonvoice

Files

0-download-unzip.ipynb - notebook for download and unzip CommonVoice V8
1-convert-mp3-wav.ipynb - notebook for convert mp3 files to wav files
1-preprocessing-thai-cv-v8-wav2vev2.ipynb - notebook for preprocessing CommonVoice V8 (old file)
2-gen-val-json.py - python file for get manifest in nvidia meno asr
2-preprocessing-thai-cv-v8-wav2vev2.ipynb - notebook for preprocessing CommonVoice V8
4-gen-manifest.ipynb - notebook for get manifest in nvidia meno asr
build-lm.ipynb - notebook for build ASR LM
test-ai4thai.ipynb - notebook for test AI For Thai.
test-google.ipynb - notebook for test Google ASR.
test-v7.ipynb - notebook for test vistec-AI/wav2vec2-large-xlsr-53-th model.
test-wav2vec2-lm.ipynb - notebook for test our model with LM.
test-wav2vec2.ipynb - notebook for test our model without LM.
train-wav2vec2.py - python file for training model.

Links:

GitHub Dataset: https://github.com/wannaphong/thai_commonvoice_dataset