argmaxinc
/

whisperkit-coreml_01-30-24

 ---
+pretty_name: "WhisperKit ASR Evaluation Results"
+tags:
+- whisper
+- whisperkit
+- coreml
+- asr
+- quantized
 ---
+# WhisperKit Evaluation Results
+## Dataset: `librispeech`
+### WhisperKit + `openai_whisper-large-v3` (+optimized variants)
+|                                                                                                                                               |   WER |   QoI (%) |   File Size (MB) |
+|:----------------------------------------------------------------------------------------------------------------------------------------------|------:|----------:|-----------------:|
+| [openai_whisper-large-v3](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3)                           |  2.44 |     100   |             3100 |
+| [openai_whisper-large-v3_turbo](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo)               |  2.41 |      99.8 |             3100 |
+| [openai_whisper-large-v3_turbo_1307MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo_1307MB) |  2.6  |      97.7 |             1307 |
+| [openai_whisper-large-v3_turbo_1049MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_turbo_1049MB) |  4.81 |      91   |             1049 |
+| [openai_whisper-large-v3_1053MB](https://huggingface.co/argmaxinc/whisperkit-coreml-rc1/tree/main/openai_whisper-large-v3_1053MB)             |  4.65 |      90.8 |             1053 |
+### Different Projects + `openai_whisper-large-v3`
+|                                                                                                  | WER                                                                                                                     | Commit Hash   | Model Format   |
+|:-------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------|:--------------|:---------------|
+| [WhisperKit](https://github.com/argmaxinc/whisperkit)                                            | [2.44](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/WhisperKit/openai_whisper-large-v3/librispeech)  | 0f8b4fe       | Core ML        |
+| [WhisperCpp](https://github.com/ggerganov/whisper.cpp)                                           | [2.36](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/whisper.cpp/openai_whisper-large-v3/librispeech) | e72e415       | Core ML + GGUF |
+| [WhisperMLX](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py) | [2.69](https://hf.co/datasets/argmaxinc/whisperkit-evals-rc1/tree/main/WhisperMLX/openai_whisper-large-v3/librispeech)  | 614de66       | MLX (Numpy)    |
+### Quality-of-Inference (QoI) Certification
+We believe that rigorously measuring the quality of inference is necessary for developers and
+enterprises to make informed decisions when opting to use optimized or compressed variants of
+Whisper models in production. The current measurements are between reference and optimized
+WhisperKit models. We are going to extend the scope of this measurement to other Whisper
+implementations soon so developers can certify the behavior change (if any) caused by
+alternating use of WhisperKit with (or migration from) these implementations.
+In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
+which is a stricter metric compared to dataset average WER. A 100% `qoi` preserves perfect
+backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
+where per-example known behavior changes after a code/model update and causes divergence in
+downstream code or breaks the user experience itself (even if dataset averages might stay flat
+across updates). Pseudocode for `qoi`:
+```python
+qoi = []
+for example in dataset:
+    no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
+    qoi.append(no_regression)
+qoi = (sum(qoi) / len(qoi)) * 100.
+```
+We define the reference model as the default float16 precision Core ML model that is generated by
+whisperkittools. This reference model matches the accuracy of the original PyTorch model
+on the specified test sets. We use `librispeech/test.clean` (5 hours of short English audio clips)
+as our testing set for Whisper. We are actively expanding our test set coverage to `earnings22`
+(120 hours of long English audio clips with various accents). We anticipate developers that use Whisper in production to have
+their own Quality Assurance test sets and whisperkittools offers the tooling necessary to run the
+same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset](#evaluate-on-custom-dataset)
+for details.
+### Reproducing Results
+Results in this page are generated by our cluster of Apple Silicon Macs. We use them as self-hosted runners on
+Github Actions as our CI infrastructure. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
+we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
+run identical [evaluation jobs](#evaluation)
+locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
+evaluation in under 1 hour regardless of the Whisper implementation. Older Apple Silicon Macs should take less than
+1 day to complete the same evaluation.
+Glossary:
+- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
+as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).
+- `_*MB`: Indicates the presence of mixed-bit quantization. Instead of cluttering the filename with details like
+`_AudioEncoder-5.8bits_TextDecoder-6.1bits`, we choose to summarize the compression spec as the resulting total file size since this is what matters to developers in production.