poonehmousavi commited on
Commit
5025284
·
1 Parent(s): 74357e5

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +36 -18
  2. config.json +67 -67
  3. hyperparams.yaml +52 -99
README.md CHANGED
@@ -1,42 +1,60 @@
1
  ---
2
- language: "en"
3
- thumbnail:
 
4
  pipeline_tag: automatic-speech-recognition
5
  tags:
6
  - CTC
7
  - pytorch
8
  - speechbrain
9
  - Transformer
10
- license: "apache-2.0"
11
  datasets:
12
  - commonvoice
13
  metrics:
14
  - wer
15
  - cer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ---
17
 
18
  <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
19
  <br/><br/>
20
 
21
- # wav2vec 2.0 with CTC trained on CommonVoice English (No LM)
22
 
23
  This repository provides all the necessary tools to perform automatic speech
24
- recognition from an end-to-end system pretrained on CommonVoice (English Language) within
25
  SpeechBrain. For a better experience, we encourage you to learn more about
26
- [SpeechBrain](https://speechbrain.github.io).
27
 
28
  The performance of the model is the following:
29
 
30
  | Release | Test CER | Test WER | GPUs |
31
- |:--------------:|:--------------:|:--------------:| :--------:|
32
- | 15-08-23 | 7.92 | 16.68 | 1xV100 32GB |
33
 
34
  ## Pipeline description
35
 
36
  This ASR system is composed of 2 different but linked blocks:
37
- - Tokenizer (unigram) that transforms words into subword units and trained with
38
- the train transcriptions (train.tsv) of CommonVoice (EN).
39
- - Acoustic model (wav2vec2.0 + CTC). A pretrained wav2vec 2.0 model ([wav2vec2-lv60-large](https://huggingface.co/facebook/wav2vec2-large-lv60)) is combined with two DNN layers and finetuned on CommonVoice En.
40
  The obtained final acoustic representation is given to the CTC decoder.
41
 
42
  The system is trained with recordings sampled at 16kHz (single channel).
@@ -53,13 +71,13 @@ pip install speechbrain transformers
53
  Please notice that we encourage you to read our tutorials and learn more about
54
  [SpeechBrain](https://speechbrain.github.io).
55
 
56
- ### Transcribing your own audio files (in English)
57
 
58
  ```python
59
- from speechbrain.pretrained import EncoderDecoderASR
60
 
61
- asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-14-en", savedir="pretrained_models/asr-wav2vec2-commonvoice-14-en")
62
- asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-14-en/example.wav")
63
 
64
  ```
65
  ### Inference on GPU
@@ -85,10 +103,10 @@ pip install -e .
85
  3. Run Training:
86
  ```bash
87
  cd recipes/CommonVoice/ASR/seq2seq
88
- python train.py hparams/train_en_with_wav2vec.yaml --data_folder=your_data_folder
89
  ```
90
 
91
- You can find our training results (models, logs, etc) [here](https://www.dropbox.com/sh/ch10cnbhf1faz3w/AACdHFG65LC6582H0Tet_glTa?dl=0).
92
 
93
  ### Limitations
94
  The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
@@ -113,4 +131,4 @@ Please, cite SpeechBrain if you use it for your research or business.
113
  primaryClass={eess.AS},
114
  note={arXiv:2106.04624}
115
  }
116
- ```
 
1
  ---
2
+ language:
3
+ - de
4
+ thumbnail: null
5
  pipeline_tag: automatic-speech-recognition
6
  tags:
7
  - CTC
8
  - pytorch
9
  - speechbrain
10
  - Transformer
11
+ license: apache-2.0
12
  datasets:
13
  - commonvoice
14
  metrics:
15
  - wer
16
  - cer
17
+ model-index:
18
+ - name: asr-wav2vec2-commonvoice-de
19
+ results:
20
+ - task:
21
+ name: Automatic Speech Recognition
22
+ type: automatic-speech-recognition
23
+ dataset:
24
+ name: CommonVoice Corpus 10.0/ (German)
25
+ type: mozilla-foundation/common_voice_10_1
26
+ config: de
27
+ split: test
28
+ args:
29
+ language: de
30
+ metrics:
31
+ - name: Test WER
32
+ type: wer
33
+ value: '9.54'
34
  ---
35
 
36
  <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
37
  <br/><br/>
38
 
39
+ # wav2vec 2.0 with CTC trained on CommonVoice German (No LM)
40
 
41
  This repository provides all the necessary tools to perform automatic speech
42
+ recognition from an end-to-end system pretrained on CommonVoice (German Language) within
43
  SpeechBrain. For a better experience, we encourage you to learn more about
44
+ [SpeechBrain](https://speechbrain.github.io).
45
 
46
  The performance of the model is the following:
47
 
48
  | Release | Test CER | Test WER | GPUs |
49
+ |:-------------:|:--------------:|:--------------:| :--------:|
50
+ | 16-08-22 | 2.40 | 9.54 | 1xRTXA6000 48GB |
51
 
52
  ## Pipeline description
53
 
54
  This ASR system is composed of 2 different but linked blocks:
55
+ - Tokenizer (char) that transforms words into chars and trained with
56
+ the train transcriptions (train.tsv) of CommonVoice (DE).
57
+ - Acoustic model (wav2vec2.0 + CTC). A pretrained wav2vec 2.0 model ([wav2vec2-large-xlsr-53-german](https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german)) is combined with two DNN layers and finetuned on CommonVoice DE.
58
  The obtained final acoustic representation is given to the CTC decoder.
59
 
60
  The system is trained with recordings sampled at 16kHz (single channel).
 
71
  Please notice that we encourage you to read our tutorials and learn more about
72
  [SpeechBrain](https://speechbrain.github.io).
73
 
74
+ ### Transcribing your own audio files (in German)
75
 
76
  ```python
77
+ from speechbrain.pretrained import EncoderASR
78
 
79
+ asr_model = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-de", savedir="pretrained_models/asr-wav2vec2-commonvoice-de")
80
+ asr_model.transcribe_file("speechbrain/asr-wav2vec2-commonvoice-de/example-de.wav")
81
 
82
  ```
83
  ### Inference on GPU
 
103
  3. Run Training:
104
  ```bash
105
  cd recipes/CommonVoice/ASR/seq2seq
106
+ python train.py hparams/train_de_with_wav2vec.yaml --data_folder=your_data_folder
107
  ```
108
 
109
+ You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/19G2Zm8896QSVDqVfs7PS_W86-K0-5xeC?usp=sharing).
110
 
111
  ### Limitations
112
  The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
 
131
  primaryClass={eess.AS},
132
  note={arXiv:2106.04624}
133
  }
134
+ ```
config.json CHANGED
@@ -1,69 +1,69 @@
1
  {
2
- "speechbrain_interface": "EncoderDecoderASR",
3
- "activation_dropout": 0.1,
4
- "apply_spec_augment": true,
5
- "architectures": [
6
- "Wav2Vec2Model"
7
- ],
8
- "attention_dropout": 0.1,
9
- "bos_token_id": 1,
10
- "conv_bias": true,
11
- "conv_dim": [
12
- 512,
13
- 512,
14
- 512,
15
- 512,
16
- 512,
17
- 512,
18
- 512
19
- ],
20
- "conv_kernel": [
21
- 10,
22
- 3,
23
- 3,
24
- 3,
25
- 3,
26
- 2,
27
- 2
28
- ],
29
- "conv_stride": [
30
- 5,
31
- 2,
32
- 2,
33
- 2,
34
- 2,
35
- 2,
36
- 2
37
- ],
38
- "ctc_loss_reduction": "sum",
39
- "ctc_zero_infinity": false,
40
- "do_stable_layer_norm": true,
41
- "eos_token_id": 2,
42
- "feat_extract_activation": "gelu",
43
- "feat_extract_dropout": 0.0,
44
- "feat_extract_norm": "layer",
45
- "feat_proj_dropout": 0.1,
46
- "final_dropout": 0.1,
47
- "gradient_checkpointing": false,
48
- "hidden_act": "gelu",
49
- "hidden_dropout": 0.1,
50
- "hidden_dropout_prob": 0.1,
51
- "hidden_size": 1024,
52
- "initializer_range": 0.02,
53
- "intermediate_size": 4096,
54
- "layer_norm_eps": 1e-05,
55
- "layerdrop": 0.1,
56
- "mask_feature_length": 10,
57
- "mask_feature_prob": 0.0,
58
- "mask_time_length": 10,
59
- "mask_time_prob": 0.05,
60
- "model_type": "wav2vec2",
61
- "num_attention_heads": 16,
62
- "num_conv_pos_embedding_groups": 16,
63
- "num_conv_pos_embeddings": 128,
64
- "num_feat_extract_layers": 7,
65
- "num_hidden_layers": 24,
66
- "pad_token_id": 0,
67
- "transformers_version": "4.4.0.dev0",
68
- "vocab_size": 32
69
  }
 
1
  {
2
+ "speechbrain_interface": "EncoderASR",
3
+ "activation_dropout": 0.1,
4
+ "apply_spec_augment": true,
5
+ "architectures": [
6
+ "Wav2Vec2Model"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "bos_token_id": 1,
10
+ "conv_bias": true,
11
+ "conv_dim": [
12
+ 512,
13
+ 512,
14
+ 512,
15
+ 512,
16
+ 512,
17
+ 512,
18
+ 512
19
+ ],
20
+ "conv_kernel": [
21
+ 10,
22
+ 3,
23
+ 3,
24
+ 3,
25
+ 3,
26
+ 2,
27
+ 2
28
+ ],
29
+ "conv_stride": [
30
+ 5,
31
+ 2,
32
+ 2,
33
+ 2,
34
+ 2,
35
+ 2,
36
+ 2
37
+ ],
38
+ "ctc_loss_reduction": "sum",
39
+ "ctc_zero_infinity": false,
40
+ "do_stable_layer_norm": true,
41
+ "eos_token_id": 2,
42
+ "feat_extract_activation": "gelu",
43
+ "feat_extract_dropout": 0.0,
44
+ "feat_extract_norm": "layer",
45
+ "feat_proj_dropout": 0.1,
46
+ "final_dropout": 0.1,
47
+ "gradient_checkpointing": false,
48
+ "hidden_act": "gelu",
49
+ "hidden_dropout": 0.1,
50
+ "hidden_dropout_prob": 0.1,
51
+ "hidden_size": 1024,
52
+ "initializer_range": 0.02,
53
+ "intermediate_size": 4096,
54
+ "layer_norm_eps": 1e-05,
55
+ "layerdrop": 0.1,
56
+ "mask_feature_length": 10,
57
+ "mask_feature_prob": 0.0,
58
+ "mask_time_length": 10,
59
+ "mask_time_prob": 0.05,
60
+ "model_type": "wav2vec2",
61
+ "num_attention_heads": 16,
62
+ "num_conv_pos_embedding_groups": 16,
63
+ "num_conv_pos_embeddings": 128,
64
+ "num_feat_extract_layers": 7,
65
+ "num_hidden_layers": 24,
66
+ "pad_token_id": 0,
67
+ "transformers_version": "4.21.1",
68
+ "vocab_size": 32
69
  }
hyperparams.yaml CHANGED
@@ -1,138 +1,91 @@
 
 
 
1
  # ################################
2
- # Model: wav2vec2 + DNN + CTC/Attention
3
  # Augmentation: SpecAugment
4
- # Authors: Titouan Parcollet 2021
5
  # ################################
6
 
7
- sample_rate: 16000
8
- wav2vec2_hub: facebook/wav2vec2-large-lv60
9
-
10
  # BPE parameters
11
- token_type: unigram # ["unigram", "bpe", "char"]
12
  character_coverage: 1.0
13
 
14
  # Model parameters
15
- activation: !name:torch.nn.LeakyReLU
16
- dnn_layers: 2
17
  dnn_neurons: 1024
18
- emb_size: 128
19
- dec_neurons: 1024
 
 
 
 
20
 
21
  # Outputs
22
- output_neurons: 1000 # BPE size, index(blank/eos/bos) = 0
23
 
24
  # Decoding parameters
25
  # Be sure that the bos and eos index match with the BPEs ones
26
  blank_index: 0
27
  bos_index: 1
28
  eos_index: 2
29
- min_decode_ratio: 0.0
30
- max_decode_ratio: 1.0
31
- beam_size: 10
32
- eos_threshold: 1.5
33
- using_max_attn_shift: True
34
- max_attn_shift: 140
35
- ctc_weight_decode: 0.0
36
- temperature: 1.50
37
-
38
- # enc: !new:speechbrain.lobes.models.VanillaNN.VanillaNN
39
- # input_shape: [null, null, 1024]
40
- # activation: !ref <activation>
41
- # dnn_blocks: !ref <dnn_layers>
42
- # dnn_neurons: !ref <dnn_neurons>
43
 
44
  enc: !new:speechbrain.nnet.containers.Sequential
45
- input_shape: [null, null, 1024]
46
- linear1: !name:speechbrain.nnet.linear.Linear
47
- n_neurons: !ref <dnn_neurons>
48
- bias: True
49
- bn1: !name:speechbrain.nnet.normalization.BatchNorm1d
50
- activation: !new:torch.nn.LeakyReLU
51
-
52
- linear2: !name:speechbrain.nnet.linear.Linear
53
- n_neurons: !ref <dnn_neurons>
54
- bias: True
55
- bn2: !name:speechbrain.nnet.normalization.BatchNorm1d
56
- activation2: !new:torch.nn.LeakyReLU
57
-
58
- linear3: !name:speechbrain.nnet.linear.Linear
59
- n_neurons: !ref <dnn_neurons>
60
- bias: True
61
- bn3: !name:speechbrain.nnet.normalization.BatchNorm1d
62
- activation3: !new:torch.nn.LeakyReLU
 
 
63
 
64
  wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
65
- source: !ref <wav2vec2_hub>
66
- output_norm: True
67
- freeze: True
68
- save_path: model_checkpoints
69
-
70
- emb: !new:speechbrain.nnet.embedding.Embedding
71
- num_embeddings: !ref <output_neurons>
72
- embedding_dim: !ref <emb_size>
73
-
74
- dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
75
- enc_dim: !ref <dnn_neurons>
76
- input_size: !ref <emb_size>
77
- rnn_type: gru
78
- attn_type: location
79
- hidden_size: 1024
80
- attn_dim: 1024
81
- num_layers: 1
82
- scaling: 1.0
83
- channels: 10
84
- kernel_size: 100
85
- re_init: True
86
- dropout: 0.0
87
 
88
  ctc_lin: !new:speechbrain.nnet.linear.Linear
89
- input_size: !ref <dnn_neurons>
90
- n_neurons: !ref <output_neurons>
91
-
92
- seq_lin: !new:speechbrain.nnet.linear.Linear
93
- input_size: !ref <dec_neurons>
94
- n_neurons: !ref <output_neurons>
95
 
96
  log_softmax: !new:speechbrain.nnet.activations.Softmax
97
- apply_log: True
98
 
99
  ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
100
- blank_index: !ref <blank_index>
101
-
102
- seq_cost: !name:speechbrain.nnet.losses.nll_loss
103
- label_smoothing: 0.1
104
 
105
  asr_model: !new:torch.nn.ModuleList
106
- - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]
107
 
108
  tokenizer: !new:sentencepiece.SentencePieceProcessor
109
 
110
  encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
111
  wav2vec2: !ref <wav2vec2>
112
  enc: !ref <enc>
113
-
114
- decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
115
- embedding: !ref <emb>
116
- decoder: !ref <dec>
117
- linear: !ref <seq_lin>
118
- ctc_linear: !ref <ctc_lin>
119
- bos_index: !ref <bos_index>
120
- eos_index: !ref <eos_index>
121
- blank_index: !ref <blank_index>
122
- min_decode_ratio: !ref <min_decode_ratio>
123
- max_decode_ratio: !ref <max_decode_ratio>
124
- beam_size: !ref <beam_size>
125
- eos_threshold: !ref <eos_threshold>
126
- using_max_attn_shift: !ref <using_max_attn_shift>
127
- max_attn_shift: !ref <max_attn_shift>
128
- temperature: !ref <temperature>
129
 
130
  modules:
131
- encoder: !ref <encoder>
132
- decoder: !ref <decoder>
 
 
133
 
134
  pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
135
- loadables:
136
- wav2vec2: !ref <wav2vec2>
137
- asr: !ref <asr_model>
138
- tokenizer: !ref <tokenizer>
 
1
+ # Generated 2022-08-12 from:
2
+ # /netscratch/sagar/thesis/speechbrain/recipes/CommonVoice_de/ASR/CTC/hparams/train_with_wav2vec.yaml
3
+ # yamllint disable
4
  # ################################
5
+ # Model: wav2vec2 + DNN + CTC
6
  # Augmentation: SpecAugment
7
+ # Authors: Sung-Lin Yeh 2021
8
  # ################################
9
 
 
 
 
10
  # BPE parameters
11
+ token_type: char # ["unigram", "bpe", "char"]
12
  character_coverage: 1.0
13
 
14
  # Model parameters
15
+ # activation: !name:torch.nn.LeakyReLU
 
16
  dnn_neurons: 1024
17
+ wav2vec_output_dim: 1024
18
+ dropout: 0.15
19
+
20
+ sample_rate: 16000
21
+
22
+ wav2vec2_hub: facebook/wav2vec2-large-xlsr-53-german
23
 
24
  # Outputs
25
+ output_neurons: 32 # BPE size, index(blank/eos/bos) = 0
26
 
27
  # Decoding parameters
28
  # Be sure that the bos and eos index match with the BPEs ones
29
  blank_index: 0
30
  bos_index: 1
31
  eos_index: 2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  enc: !new:speechbrain.nnet.containers.Sequential
34
+ input_shape: [null, null, !ref <wav2vec_output_dim>]
35
+ linear1: !name:speechbrain.nnet.linear.Linear
36
+ n_neurons: !ref <dnn_neurons>
37
+ bias: True
38
+ bn1: !name:speechbrain.nnet.normalization.BatchNorm1d
39
+ activation: !new:torch.nn.LeakyReLU
40
+ drop: !new:torch.nn.Dropout
41
+ p: !ref <dropout>
42
+ linear2: !name:speechbrain.nnet.linear.Linear
43
+ n_neurons: !ref <dnn_neurons>
44
+ bias: True
45
+ bn2: !name:speechbrain.nnet.normalization.BatchNorm1d
46
+ activation2: !new:torch.nn.LeakyReLU
47
+ drop2: !new:torch.nn.Dropout
48
+ p: !ref <dropout>
49
+ linear3: !name:speechbrain.nnet.linear.Linear
50
+ n_neurons: !ref <dnn_neurons>
51
+ bias: True
52
+ bn3: !name:speechbrain.nnet.normalization.BatchNorm1d
53
+ activation3: !new:torch.nn.LeakyReLU
54
 
55
  wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
56
+ source: !ref <wav2vec2_hub>
57
+ output_norm: True
58
+ freeze: True
59
+ save_path: wav2vec2_checkpoint
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  ctc_lin: !new:speechbrain.nnet.linear.Linear
62
+ input_size: !ref <dnn_neurons>
63
+ n_neurons: !ref <output_neurons>
 
 
 
 
64
 
65
  log_softmax: !new:speechbrain.nnet.activations.Softmax
66
+ apply_log: True
67
 
68
  ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
69
+ blank_index: !ref <blank_index>
 
 
 
70
 
71
  asr_model: !new:torch.nn.ModuleList
72
+ - [!ref <enc>, !ref <ctc_lin>]
73
 
74
  tokenizer: !new:sentencepiece.SentencePieceProcessor
75
 
76
  encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
77
  wav2vec2: !ref <wav2vec2>
78
  enc: !ref <enc>
79
+ ctc_lin: !ref <ctc_lin>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  modules:
82
+ encoder: !ref <encoder>
83
+
84
+ decoding_function: !name:speechbrain.decoders.ctc_greedy_decode
85
+ blank_id: !ref <blank_index>
86
 
87
  pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
88
+ loadables:
89
+ wav2vec2: !ref <wav2vec2>
90
+ asr: !ref <asr_model>
91
+ tokenizer: !ref <tokenizer>