TanelAlumae
commited on
Commit
·
8ca127d
1
Parent(s):
9744186
New version of the model
Browse files- README.md +21 -9
- config.json +1 -1
- pytorch_model.bin +1 -1
README.md
CHANGED
@@ -19,10 +19,10 @@ model-index:
|
|
19 |
metrics:
|
20 |
- name: Test WER
|
21 |
type: wer
|
22 |
-
value:
|
23 |
- name: Test CER
|
24 |
type: cer
|
25 |
-
value: 3.
|
26 |
- task:
|
27 |
name: Automatic Speech Recognition
|
28 |
type: automatic-speech-recognition
|
@@ -34,16 +34,16 @@ model-index:
|
|
34 |
metrics:
|
35 |
- name: Test WER
|
36 |
type: wer
|
37 |
-
value: 11.
|
38 |
- name: Test CER
|
39 |
type: cer
|
40 |
-
value: 2.
|
41 |
---
|
42 |
|
43 |
|
44 |
# Whisper-large-et
|
45 |
|
46 |
-
This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around
|
47 |
|
48 |
## Model description
|
49 |
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
|
@@ -55,7 +55,15 @@ This model is intended for general-purpose speech recognition, such as broadcast
|
|
55 |
|
56 |
## How to use
|
57 |
|
58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
|
61 |
#### Limitations and bias
|
@@ -72,12 +80,12 @@ Acoustic training data:
|
|
72 |
|
73 |
| Type | Amount (h) |
|
74 |
|-----------------------|:------:|
|
75 |
-
| Broadcast speech |
|
76 |
| Spontaneous speech | 53 |
|
77 |
| Elderly speech corpus | 53 |
|
78 |
| Talks, lectures | 49 |
|
79 |
| Parliament speeches | 31 |
|
80 |
-
| *Total* | *
|
81 |
|
82 |
|
83 |
|
@@ -87,6 +95,10 @@ Finetuned using Espnet, and then comverted to transformers format using [this](h
|
|
87 |
Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
|
88 |
Finetuning was done for 3 epochs, with model averaging at the end of training.
|
89 |
|
|
|
|
|
|
|
|
|
90 |
## Evaluation results
|
91 |
|
92 |
### WER
|
@@ -95,5 +107,5 @@ WER results below are obtained using greedy decoding (i.e., beam size 1).
|
|
95 |
|
96 |
|Dataset | WER |
|
97 |
|---|---|
|
98 |
-
| Common Voice 8.0 | 11.
|
99 |
| Common Voice 11.0 | 12.0 |
|
|
|
19 |
metrics:
|
20 |
- name: Test WER
|
21 |
type: wer
|
22 |
+
value: 12.03
|
23 |
- name: Test CER
|
24 |
type: cer
|
25 |
+
value: 3.18
|
26 |
- task:
|
27 |
name: Automatic Speech Recognition
|
28 |
type: automatic-speech-recognition
|
|
|
34 |
metrics:
|
35 |
- name: Test WER
|
36 |
type: wer
|
37 |
+
value: 11.35
|
38 |
- name: Test CER
|
39 |
type: cer
|
40 |
+
value: 2.75
|
41 |
---
|
42 |
|
43 |
|
44 |
# Whisper-large-et
|
45 |
|
46 |
+
This is a Whisper-large-v2 model [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) finetuned on around 1200 hours of diverse Estonian data.
|
47 |
|
48 |
## Model description
|
49 |
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
|
|
|
55 |
|
56 |
## How to use
|
57 |
|
58 |
+
Recommended: use [faster-whisper](https://github.com/guillaumekln/faster-whisper).
|
59 |
+
|
60 |
+
For example:
|
61 |
+
|
62 |
+
* Convert the HF model to CT2 format:
|
63 |
+
|
64 |
+
`ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16`
|
65 |
+
|
66 |
+
* Decode: `whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3`
|
67 |
|
68 |
|
69 |
#### Limitations and bias
|
|
|
80 |
|
81 |
| Type | Amount (h) |
|
82 |
|-----------------------|:------:|
|
83 |
+
| Broadcast speech | 991 |
|
84 |
| Spontaneous speech | 53 |
|
85 |
| Elderly speech corpus | 53 |
|
86 |
| Talks, lectures | 49 |
|
87 |
| Parliament speeches | 31 |
|
88 |
+
| *Total* | *1161* |
|
89 |
|
90 |
|
91 |
|
|
|
95 |
Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
|
96 |
Finetuning was done for 3 epochs, with model averaging at the end of training.
|
97 |
|
98 |
+
*Update*: 2023-10-03 bersion of the model is trained on long segments (like the original Whisper model) and
|
99 |
+
is therefore especially well suited to be used e.g. with [faster-whisper](https://github.com/guillaumekln/faster-whisper) to
|
100 |
+
transcribe long speech recordings "end-to-end" (i.e., without any prior segmentation).
|
101 |
+
|
102 |
## Evaluation results
|
103 |
|
104 |
### WER
|
|
|
107 |
|
108 |
|Dataset | WER |
|
109 |
|---|---|
|
110 |
+
| Common Voice 8.0 | 11.3 |
|
111 |
| Common Voice 11.0 | 12.0 |
|
config.json
CHANGED
@@ -28,7 +28,7 @@
|
|
28 |
"forced_decoder_ids": [
|
29 |
[
|
30 |
1,
|
31 |
-
|
32 |
],
|
33 |
[
|
34 |
2,
|
|
|
28 |
"forced_decoder_ids": [
|
29 |
[
|
30 |
1,
|
31 |
+
50259
|
32 |
],
|
33 |
[
|
34 |
2,
|
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 6173637880
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8f8edba2e2b8974654d430b0ffe9d6bb1e7a394e84f226fe7a5acaf3bc94d6f3
|
3 |
size 6173637880
|