airesearch
/

wav2vec2-large-xlsr-53-th

Automatic Speech Recognition

hf-asr-leaderboard

robust-speech-event

xlsr-fine-tuning

Inference Endpoints

Model card Files Files and versions Community

cstorm125 commited on Aug 30, 2021

Commit

f82bd96

·

1 Parent(s): cbc9d72

Update README.md

Files changed (1) hide show

README.md +7 -2

README.md CHANGED Viewed

@@ -53,7 +53,7 @@ print("Reference:", test_dataset["sentence"][:2])
 ## Datasets
-Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with `pythainlp.tokenize.word_tokenize`. We preprocess the dataset using cleaning rules described in `notebooks/cv-preprocess.ipynb` by [@tann9949](https://github.com/tann9949). We then deduplicate and split as described in [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) in order to 1) avoid data leakage due to random splits after cleaning in [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) and 2) preserve the majority of the data for the training set. The dataset loading script is `scripts/th_common_voice_70.py`. The resulting dataset is as follows:
 ```
 DatasetDict({
@@ -122,5 +122,10 @@ We benchmark on the test set using WER with words tokenized by [PyThaiNLP](https
 | without spell correction | 0.20754109 | 0.03727126 |
 | with spell correction    | TBD        | TBD        |

 ## Datasets
+Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with `pythainlp.tokenize.word_tokenize`. We preprocess the dataset using cleaning rules described in `notebooks/cv-preprocess.ipynb` by [@tann9949](https://github.com/tann9949). We then deduplicate and split as described in [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) in order to 1) avoid data leakage due to random splits after cleaning in [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) and 2) preserve the majority of the data for the training set. The dataset loading script is `scripts/th_common_voice_70.py`. You can use this scripts together with `train_cleand.tsv`, `validation_cleaned.tsv` and `test_cleaned.tsv` to have the same splits as we do. The resulting dataset is as follows:
 ```
 DatasetDict({
 | without spell correction | 0.20754109 | 0.03727126 |
 | with spell correction    | TBD        | TBD        |
+## Ackowledgements
+* model training and validation notebooks/scripts [@cstorm125](https://github.com/cstorm125/)
+* dataset cleaning scripts [@tann9949](https://github.com/tann9949)
+* dataset splits [@ekapolc](https://github.com/ekapolc/) and his students
+* running the training [@mrpeerat](https://github.com/mrpeerat)
+* spell correction [@wannaphong](https://github.com/wannaphong)