cstorm125 commited on
Commit
f82bd96
·
1 Parent(s): cbc9d72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -2
README.md CHANGED
@@ -53,7 +53,7 @@ print("Reference:", test_dataset["sentence"][:2])
53
 
54
  ## Datasets
55
 
56
- Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with `pythainlp.tokenize.word_tokenize`. We preprocess the dataset using cleaning rules described in `notebooks/cv-preprocess.ipynb` by [@tann9949](https://github.com/tann9949). We then deduplicate and split as described in [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) in order to 1) avoid data leakage due to random splits after cleaning in [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) and 2) preserve the majority of the data for the training set. The dataset loading script is `scripts/th_common_voice_70.py`. The resulting dataset is as follows:
57
 
58
  ```
59
  DatasetDict({
@@ -122,5 +122,10 @@ We benchmark on the test set using WER with words tokenized by [PyThaiNLP](https
122
  | without spell correction | 0.20754109 | 0.03727126 |
123
  | with spell correction | TBD | TBD |
124
 
125
-
 
 
 
 
 
126
 
 
53
 
54
  ## Datasets
55
 
56
+ Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with `pythainlp.tokenize.word_tokenize`. We preprocess the dataset using cleaning rules described in `notebooks/cv-preprocess.ipynb` by [@tann9949](https://github.com/tann9949). We then deduplicate and split as described in [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) in order to 1) avoid data leakage due to random splits after cleaning in [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) and 2) preserve the majority of the data for the training set. The dataset loading script is `scripts/th_common_voice_70.py`. You can use this scripts together with `train_cleand.tsv`, `validation_cleaned.tsv` and `test_cleaned.tsv` to have the same splits as we do. The resulting dataset is as follows:
57
 
58
  ```
59
  DatasetDict({
 
122
  | without spell correction | 0.20754109 | 0.03727126 |
123
  | with spell correction | TBD | TBD |
124
 
125
+ ## Ackowledgements
126
+ * model training and validation notebooks/scripts [@cstorm125](https://github.com/cstorm125/)
127
+ * dataset cleaning scripts [@tann9949](https://github.com/tann9949)
128
+ * dataset splits [@ekapolc](https://github.com/ekapolc/) and his students
129
+ * running the training [@mrpeerat](https://github.com/mrpeerat)
130
+ * spell correction [@wannaphong](https://github.com/wannaphong)
131