TanelAlumae commited on
Commit
8ca127d
·
1 Parent(s): 9744186

New version of the model

Browse files
Files changed (3) hide show
  1. README.md +21 -9
  2. config.json +1 -1
  3. pytorch_model.bin +1 -1
README.md CHANGED
@@ -19,10 +19,10 @@ model-index:
19
  metrics:
20
  - name: Test WER
21
  type: wer
22
- value: 11.99
23
  - name: Test CER
24
  type: cer
25
- value: 3.21
26
  - task:
27
  name: Automatic Speech Recognition
28
  type: automatic-speech-recognition
@@ -34,16 +34,16 @@ model-index:
34
  metrics:
35
  - name: Test WER
36
  type: wer
37
- value: 11.22
38
  - name: Test CER
39
  type: cer
40
- value: 2.813
41
  ---
42
 
43
 
44
  # Whisper-large-et
45
 
46
- This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around 800 hours of diverse Estonian data.
47
 
48
  ## Model description
49
  This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
@@ -55,7 +55,15 @@ This model is intended for general-purpose speech recognition, such as broadcast
55
 
56
  ## How to use
57
 
58
- Use as any other Whisper model via HF transformers, or use a faster decoder like [faster-whisper](https://github.com/guillaumekln/faster-whisper).
 
 
 
 
 
 
 
 
59
 
60
 
61
  #### Limitations and bias
@@ -72,12 +80,12 @@ Acoustic training data:
72
 
73
  | Type | Amount (h) |
74
  |-----------------------|:------:|
75
- | Broadcast speech | 591 |
76
  | Spontaneous speech | 53 |
77
  | Elderly speech corpus | 53 |
78
  | Talks, lectures | 49 |
79
  | Parliament speeches | 31 |
80
- | *Total* | *761* |
81
 
82
 
83
 
@@ -87,6 +95,10 @@ Finetuned using Espnet, and then comverted to transformers format using [this](h
87
  Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
88
  Finetuning was done for 3 epochs, with model averaging at the end of training.
89
 
 
 
 
 
90
  ## Evaluation results
91
 
92
  ### WER
@@ -95,5 +107,5 @@ WER results below are obtained using greedy decoding (i.e., beam size 1).
95
 
96
  |Dataset | WER |
97
  |---|---|
98
- | Common Voice 8.0 | 11.2 |
99
  | Common Voice 11.0 | 12.0 |
 
19
  metrics:
20
  - name: Test WER
21
  type: wer
22
+ value: 12.03
23
  - name: Test CER
24
  type: cer
25
+ value: 3.18
26
  - task:
27
  name: Automatic Speech Recognition
28
  type: automatic-speech-recognition
 
34
  metrics:
35
  - name: Test WER
36
  type: wer
37
+ value: 11.35
38
  - name: Test CER
39
  type: cer
40
+ value: 2.75
41
  ---
42
 
43
 
44
  # Whisper-large-et
45
 
46
+ This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around 1200 hours of diverse Estonian data.
47
 
48
  ## Model description
49
  This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
 
55
 
56
  ## How to use
57
 
58
+ Recommended: use [faster-whisper](https://github.com/guillaumekln/faster-whisper).
59
+
60
+ For example:
61
+
62
+ * Convert the HF model to CT2 format:
63
+
64
+ `ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16`
65
+
66
+ * Decode: `whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3`
67
 
68
 
69
  #### Limitations and bias
 
80
 
81
  | Type | Amount (h) |
82
  |-----------------------|:------:|
83
+ | Broadcast speech | 991 |
84
  | Spontaneous speech | 53 |
85
  | Elderly speech corpus | 53 |
86
  | Talks, lectures | 49 |
87
  | Parliament speeches | 31 |
88
+ | *Total* | *1161* |
89
 
90
 
91
 
 
95
  Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
96
  Finetuning was done for 3 epochs, with model averaging at the end of training.
97
 
98
+ *Update*: 2023-10-03 bersion of the model is trained on long segments (like the original Whisper model) and
99
+ is therefore especially well suited to be used e.g. with [faster-whisper](https://github.com/guillaumekln/faster-whisper) to
100
+ transcribe long speech recordings "end-to-end" (i.e., without any prior segmentation).
101
+
102
  ## Evaluation results
103
 
104
  ### WER
 
107
 
108
  |Dataset | WER |
109
  |---|---|
110
+ | Common Voice 8.0 | 11.3 |
111
  | Common Voice 11.0 | 12.0 |
config.json CHANGED
@@ -28,7 +28,7 @@
28
  "forced_decoder_ids": [
29
  [
30
  1,
31
- 50307
32
  ],
33
  [
34
  2,
 
28
  "forced_decoder_ids": [
29
  [
30
  1,
31
+ 50259
32
  ],
33
  [
34
  2,
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7cf75ef7c93c718da4063d40b765fabe6825c6754d223801a2ebe944661ce0b9
3
  size 6173637880
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f8edba2e2b8974654d430b0ffe9d6bb1e7a394e84f226fe7a5acaf3bc94d6f3
3
  size 6173637880