Update README.md
Browse files
README.md
CHANGED
@@ -53,7 +53,7 @@ print("Reference:", test_dataset["sentence"][:2])
|
|
53 |
|
54 |
## Datasets
|
55 |
|
56 |
-
Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with `pythainlp.tokenize.word_tokenize`. We preprocess the dataset using cleaning rules described in `notebooks/cv-preprocess.ipynb` by [@tann9949](https://github.com/tann9949). We then deduplicate and split as described in [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) in order to 1) avoid data leakage due to random splits after cleaning in [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) and 2) preserve the majority of the data for the training set. The dataset loading script is `scripts/th_common_voice_70.py`. The resulting dataset is as follows:
|
57 |
|
58 |
```
|
59 |
DatasetDict({
|
@@ -122,5 +122,10 @@ We benchmark on the test set using WER with words tokenized by [PyThaiNLP](https
|
|
122 |
| without spell correction | 0.20754109 | 0.03727126 |
|
123 |
| with spell correction | TBD | TBD |
|
124 |
|
125 |
-
|
|
|
|
|
|
|
|
|
|
|
126 |
|
|
|
53 |
|
54 |
## Datasets
|
55 |
|
56 |
+
Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with `pythainlp.tokenize.word_tokenize`. We preprocess the dataset using cleaning rules described in `notebooks/cv-preprocess.ipynb` by [@tann9949](https://github.com/tann9949). We then deduplicate and split as described in [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) in order to 1) avoid data leakage due to random splits after cleaning in [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) and 2) preserve the majority of the data for the training set. The dataset loading script is `scripts/th_common_voice_70.py`. You can use this scripts together with `train_cleand.tsv`, `validation_cleaned.tsv` and `test_cleaned.tsv` to have the same splits as we do. The resulting dataset is as follows:
|
57 |
|
58 |
```
|
59 |
DatasetDict({
|
|
|
122 |
| without spell correction | 0.20754109 | 0.03727126 |
|
123 |
| with spell correction | TBD | TBD |
|
124 |
|
125 |
+
## Ackowledgements
|
126 |
+
* model training and validation notebooks/scripts [@cstorm125](https://github.com/cstorm125/)
|
127 |
+
* dataset cleaning scripts [@tann9949](https://github.com/tann9949)
|
128 |
+
* dataset splits [@ekapolc](https://github.com/ekapolc/) and his students
|
129 |
+
* running the training [@mrpeerat](https://github.com/mrpeerat)
|
130 |
+
* spell correction [@wannaphong](https://github.com/wannaphong)
|
131 |
|