aotrih commited on
Commit
74ce4fc
·
verified ·
1 Parent(s): ebe076a

whisperkittools generated README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -1
README.md CHANGED
@@ -1,3 +1,86 @@
 
1
  ---
2
- license: mit
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
  ---
3
+ pretty_name: "WhisperKit ASR Evaluation Results"
4
+ tags:
5
+ - whisper
6
+ - whisperkit
7
+ - coreml
8
+ - asr
9
+ - quantized
10
  ---
11
+ # WhisperKit Evaluation Results
12
+
13
+
14
+
15
+ ## Dataset: `librispeech`
16
+
17
+ ### WhisperKit + `openai_whisper-large-v3` (+optimized variants)
18
+
19
+ | | WER | QoI (%) | File Size (MB) |
20
+ |:----------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
21
+ | [openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3) | 2.44 | 100 | 3100 |
22
+ | [openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo) | 2.41 | 99.8 | 3100 |
23
+ | [openai_whisper-large-v3_turbo_1307MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo_1307MB) | 2.6 | 97.7 | 1307 |
24
+ | [openai_whisper-large-v3_turbo_1049MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo_1049MB) | 4.81 | 91 | 1049 |
25
+ | [openai_whisper-large-v3_1053MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_1053MB) | 4.65 | 90.8 | 1053 |
26
+
27
+ ### Different Projects + `openai_whisper-large-v3`
28
+
29
+ | | WER | Commit Hash | Model Format |
30
+ |:-------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------|:--------------|:---------------|
31
+ | [WhisperKit](https://github.com/argmaxinc/whisperkit) | [2.44](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/WhisperKit/openai_whisper-large-v3/librispeech) | 0f8b4fe | Core ML |
32
+ | [WhisperCpp](https://github.com/ggerganov/whisper.cpp) | [2.36](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/whisper.cpp/openai_whisper-large-v3/librispeech) | e72e415 | Core ML + GGUF |
33
+ | [WhisperMLX](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py) | [2.69](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/WhisperMLX/openai_whisper-large-v3/librispeech) | 614de66 | MLX (Numpy) |
34
+
35
+
36
+ ### Quality-of-Inference (QoI) Certification
37
+ We believe that rigorously measuring the quality of inference is necessary for developers and
38
+ enterprises to make informed decisions when opting to use optimized or compressed variants of
39
+ Whisper models in production. The current measurements are between reference and optimized
40
+ WhisperKit models. We are going to extend the scope of this measurement to other Whisper
41
+ implementations soon so developers can certify the behavior change (if any) caused by
42
+ alternating use of WhisperKit with (or migration from) these implementations.
43
+
44
+ In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
45
+ which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
46
+ backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
47
+ where per-example known behavior changes after a code/model update and causes divergence in
48
+ downstream code or breaks the user experience itself (even if dataset averages might stay flat
49
+ across updates). Pseudocode for `qoi`:
50
+
51
+ ```python
52
+ qoi = []
53
+ for example in dataset:
54
+ no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
55
+ qoi.append(no_regression)
56
+ qoi = (sum(qoi) / len(qoi)) * 100.
57
+ ```
58
+
59
+ We define the reference model as the default float16 precision Core ML model that is generated by
60
+ whisperkittools. This reference model matches the accuracy of the original PyTorch model
61
+ on the specified test sets. We use `librispeech/test.clean` (5 hours of short English audio clips)
62
+ as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22`
63
+ (120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have
64
+ their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the
65
+ same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset)
66
+ for details.
67
+
68
+ ### Reproducing Results
69
+ Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
70
+ Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
71
+ we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
72
+ run identical [evaluation jobs](#evaluation)
73
+ locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
74
+ evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than
75
+ 1 day to complete the same evaluation.
76
+
77
+
78
+
79
+ Glossary:
80
+
81
+ - `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
82
+ as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
83
+
84
+ - `_*MB`: Indicates the presence of mixed-bit quantization. Instead of cluttering the filename with details like
85
+ `_AudioEncoder-5.8bits_TextDecoder-6.1bits`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.
86
+