w11wo's picture
Update README.md
a2ccdd3
|
raw
history blame
9.3 kB
metadata
language: ko
license: apache-2.0
tags:
  - automatic-speech-recognition
  - generated_from_trainer
  - robust-speech-event
datasets:
  - kresnik/zeroth_korean
model-index:
  - name: Wav2Vec2 XLS-R 300M Korean LM
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Zeroth Korean
          type: kresnik/zeroth_korean
          args: clean
        metrics:
          - name: Test WER
            type: wer
            value: 30.94
          - name: Test CER
            type: cer
            value: 7.97
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Robust Speech Event - Dev Data
          type: speech-recognition-community-v2/dev_data
          args: ko
        metrics:
          - name: Test WER
            type: wer
            value: 68.34
          - name: Test CER
            type: cer
            value: 37.08

Wav2Vec2 XLS-R 300M Korean LM

Wav2Vec2 XLS-R 300M Korean LM is an automatic speech recognition model based on the XLS-R architecture. This model is a fine-tuned version of Wav2Vec2-XLS-R-300M on the Zeroth Korean dataset. A 5-gram Language model, trained on the Korean subset of Open Subtitles, was then subsequently added to this model.

This model was trained using HuggingFace's PyTorch framework and is part of the Robust Speech Challenge Event organized by HuggingFace. All training was done on a Tesla V100, sponsored by OVH.

All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.

As for the N-gram language model training, we followed the blog post tutorial provided by HuggingFace.

Model

Model #params Arch. Training/Validation data (text)
wav2vec2-xls-r-300m-korean-lm 300M XLS-R Zeroth Korean Dataset

Evaluation Results

The model achieves the following results on evaluation without a language model:

Dataset WER CER
Zeroth Korean 29.54% 9.53%
Robust Speech Event - Dev Data 76.26% 38.67%

With the addition of the language model, it achieves the following results:

Dataset WER CER
Zeroth Korean 30.94% 7.97%
Robust Speech Event - Dev Data 68.34% 37.08%

Training procedure

The training process did not involve the addition of a language model. The following results were simply lifted from the original automatic speech recognition model training.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 7.5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 2000
  • num_epochs: 50.0
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Wer Cer
19.7138 0.72 500 19.6427 1.0 1.0
4.8039 1.44 1000 4.7842 1.0 1.0
4.5619 2.16 1500 4.5608 0.9992 0.9598
4.254 2.88 2000 4.2729 0.9955 0.9063
4.1905 3.6 2500 4.2257 0.9903 0.8758
4.0683 4.32 3000 3.9294 0.9937 0.7911
3.486 5.04 3500 2.7045 1.0012 0.5934
2.946 5.75 4000 1.9691 0.9425 0.4634
2.634 6.47 4500 1.5212 0.8807 0.3850
2.4066 7.19 5000 1.2551 0.8177 0.3601
2.2651 7.91 5500 1.0423 0.7650 0.3039
2.1828 8.63 6000 0.9599 0.7273 0.3106
2.1023 9.35 6500 0.9482 0.7161 0.3063
2.0536 10.07 7000 0.8242 0.6767 0.2860
1.9803 10.79 7500 0.7643 0.6563 0.2637
1.9468 11.51 8000 0.7319 0.6441 0.2505
1.9178 12.23 8500 0.6937 0.6320 0.2489
1.8515 12.95 9000 0.6443 0.6053 0.2196
1.8083 13.67 9500 0.6286 0.6122 0.2148
1.819 14.39 10000 0.6015 0.5986 0.2074
1.7684 15.11 10500 0.5682 0.5741 0.1982
1.7195 15.83 11000 0.5385 0.5592 0.2007
1.7044 16.55 11500 0.5362 0.5524 0.2097
1.6879 17.27 12000 0.5119 0.5489 0.2083
1.656 17.98 12500 0.4990 0.5362 0.1968
1.6122 18.7 13000 0.4561 0.5092 0.1900
1.5919 19.42 13500 0.4778 0.5225 0.1975
1.5896 20.14 14000 0.4563 0.5098 0.1859
1.5589 20.86 14500 0.4362 0.4940 0.1725
1.5353 21.58 15000 0.4140 0.4826 0.1580
1.5441 22.3 15500 0.4031 0.4742 0.1550
1.5116 23.02 16000 0.3916 0.4748 0.1545
1.4731 23.74 16500 0.3841 0.4810 0.1542
1.4647 24.46 17000 0.3752 0.4524 0.1475
1.4328 25.18 17500 0.3587 0.4476 0.1461
1.4129 25.9 18000 0.3429 0.4242 0.1366
1.4062 26.62 18500 0.3450 0.4251 0.1355
1.3928 27.34 19000 0.3297 0.4145 0.1322
1.3906 28.06 19500 0.3210 0.4185 0.1336
1.358 28.78 20000 0.3131 0.3970 0.1275
1.3445 29.5 20500 0.3069 0.3920 0.1276
1.3159 30.22 21000 0.3035 0.3961 0.1255
1.3044 30.93 21500 0.2952 0.3854 0.1242
1.3034 31.65 22000 0.2966 0.3772 0.1227
1.2963 32.37 22500 0.2844 0.3706 0.1208
1.2765 33.09 23000 0.2841 0.3567 0.1173
1.2438 33.81 23500 0.2734 0.3552 0.1137
1.2487 34.53 24000 0.2703 0.3502 0.1118
1.2249 35.25 24500 0.2650 0.3484 0.1142
1.2229 35.97 25000 0.2584 0.3374 0.1097
1.2374 36.69 25500 0.2568 0.3337 0.1095
1.2153 37.41 26000 0.2494 0.3327 0.1071
1.1925 38.13 26500 0.2518 0.3366 0.1077
1.1908 38.85 27000 0.2437 0.3272 0.1057
1.1858 39.57 27500 0.2396 0.3265 0.1044
1.1808 40.29 28000 0.2373 0.3156 0.1028
1.1842 41.01 28500 0.2356 0.3152 0.1026
1.1668 41.73 29000 0.2319 0.3188 0.1025
1.1448 42.45 29500 0.2293 0.3099 0.0995
1.1327 43.17 30000 0.2265 0.3047 0.0979
1.1307 43.88 30500 0.2222 0.3078 0.0989
1.1419 44.6 31000 0.2215 0.3038 0.0981
1.1231 45.32 31500 0.2193 0.3013 0.0972
1.139 46.04 32000 0.2162 0.3007 0.0968
1.1114 46.76 32500 0.2122 0.2982 0.0960
1.111 47.48 33000 0.2125 0.2946 0.0948
1.0982 48.2 33500 0.2099 0.2957 0.0953
1.109 48.92 34000 0.2092 0.2955 0.0955
1.0905 49.64 34500 0.2088 0.2954 0.0953

Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

Authors

Wav2Vec2 XLS-R 300M Korean LM was trained and evaluated by Wilson Wongso. All computation and development are done on OVH Cloud.

Framework versions

  • Transformers 4.17.0.dev0
  • Pytorch 1.10.2+cu102
  • Datasets 1.18.2.dev0
  • Tokenizers 0.10.3