whisperkittools generated README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,86 @@
|
|
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
---
|
3 |
+
pretty_name: "WhisperKit ASR Evaluation Results"
|
4 |
+
tags:
|
5 |
+
- whisper
|
6 |
+
- whisperkit
|
7 |
+
- coreml
|
8 |
+
- asr
|
9 |
+
- quantized
|
10 |
---
|
11 |
+
# WhisperKit Evaluation Results
|
12 |
+
|
13 |
+
|
14 |
+
|
15 |
+
## Dataset: `librispeech`
|
16 |
+
|
17 |
+
### WhisperKit + `openai_whisper-large-v3` (+optimized variants)
|
18 |
+
|
19 |
+
| | WER | QoI (%) | File Size (MB) |
|
20 |
+
|:----------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
|
21 |
+
| [openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3) | 2.44 | 100 | 3100 |
|
22 |
+
| [openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo) | 2.41 | 99.8 | 3100 |
|
23 |
+
| [openai_whisper-large-v3_turbo_1307MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo_1307MB) | 2.6 | 97.7 | 1307 |
|
24 |
+
| [openai_whisper-large-v3_turbo_1049MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo_1049MB) | 4.81 | 91 | 1049 |
|
25 |
+
| [openai_whisper-large-v3_1053MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_1053MB) | 4.65 | 90.8 | 1053 |
|
26 |
+
|
27 |
+
### Different Projects + `openai_whisper-large-v3`
|
28 |
+
|
29 |
+
| | WER | Commit Hash | Model Format |
|
30 |
+
|:-------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------|:--------------|:---------------|
|
31 |
+
| [WhisperKit](https://github.com/argmaxinc/whisperkit) | [2.44](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/WhisperKit/openai_whisper-large-v3/librispeech) | 0f8b4fe | Core ML |
|
32 |
+
| [WhisperCpp](https://github.com/ggerganov/whisper.cpp) | [2.36](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/whisper.cpp/openai_whisper-large-v3/librispeech) | e72e415 | Core ML + GGUF |
|
33 |
+
| [WhisperMLX](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py) | [2.69](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/WhisperMLX/openai_whisper-large-v3/librispeech) | 614de66 | MLX (Numpy) |
|
34 |
+
|
35 |
+
|
36 |
+
### Quality-of-Inference (QoI) Certification
|
37 |
+
We believe that rigorously measuring the quality of inference is necessary for developers and
|
38 |
+
enterprises to make informed decisions when opting to use optimized or compressed variants of
|
39 |
+
Whisper models in production. The current measurements are between reference and optimized
|
40 |
+
WhisperKit models. We are going to extend the scope of this measurement to other Whisper
|
41 |
+
implementations soon so developers can certify the behavior change (if any) caused by
|
42 |
+
alternating use of WhisperKit with (or migration from) these implementations.
|
43 |
+
|
44 |
+
In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
|
45 |
+
which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
|
46 |
+
backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
|
47 |
+
where per-example known behavior changes after a code/model update and causes divergence in
|
48 |
+
downstream code or breaks the user experience itself (even if dataset averages might stay flat
|
49 |
+
across updates). Pseudocode for `qoi`:
|
50 |
+
|
51 |
+
```python
|
52 |
+
qoi = []
|
53 |
+
for example in dataset:
|
54 |
+
no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
|
55 |
+
qoi.append(no_regression)
|
56 |
+
qoi = (sum(qoi) / len(qoi)) * 100.
|
57 |
+
```
|
58 |
+
|
59 |
+
We define the reference model as the default float16 precision Core ML model that is generated by
|
60 |
+
whisperkittools. This reference model matches the accuracy of the original PyTorch model
|
61 |
+
on the specified test sets. We use `librispeech/test.clean` (5 hours of short English audio clips)
|
62 |
+
as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22`
|
63 |
+
(120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have
|
64 |
+
their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the
|
65 |
+
same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset)
|
66 |
+
for details.
|
67 |
+
|
68 |
+
### Reproducing Results
|
69 |
+
Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
|
70 |
+
Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
|
71 |
+
we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
|
72 |
+
run identical [evaluation jobs](#evaluation)
|
73 |
+
locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
|
74 |
+
evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than
|
75 |
+
1 day to complete the same evaluation.
|
76 |
+
|
77 |
+
|
78 |
+
|
79 |
+
Glossary:
|
80 |
+
|
81 |
+
- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
|
82 |
+
as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
|
83 |
+
|
84 |
+
- `_*MB`: Indicates the presence of mixed-bit quantization. Instead of cluttering the filename with details like
|
85 |
+
`_AudioEncoder-5.8bits_TextDecoder-6.1bits`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.
|
86 |
+
|