File size: 15,519 Bytes
6f3d506
d2cc210
 
 
7ed1c93
d2cc210
 
 
 
 
 
4220518
1b081b3
d2cc210
 
 
ead4b2b
1679ff0
ead4b2b
5e296a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74d55a9
 
1679ff0
74d55a9
c811efe
 
 
 
 
 
 
ead4b2b
 
9c2558a
 
27fe5eb
6f3d506
45c735d
 
6f3d506
ead4b2b
64d375d
90af826
64d375d
41827a2
ead4b2b
 
d72acf0
 
90af826
64d375d
d72acf0
41827a2
45c735d
ead4b2b
45c735d
6f3d506
 
 
 
 
 
 
 
 
 
45c735d
41827a2
45c735d
 
41827a2
45c735d
 
6f3d506
4b0635d
 
 
 
 
41827a2
 
 
 
6f3d506
4b0635d
 
6f3d506
d72acf0
9c2558a
6f3d506
 
 
45c735d
6f3d506
 
 
 
d72acf0
 
 
6f3d506
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121

---
pretty_name: "WhisperKit ASR Evaluation Results"
viewer: false
library_name: whisperkit
tags:
- whisper
- whisperkit
- coreml
- asr
- quantized
---
# WhisperKit Transcription Quality



## Dataset: `librispeech`
Short-form Audio (<30s/clip) - 5 hours of English audiobook clips

|                                                                                                                               | WER (↓)                                                                                                                               |   QoI (↑) |   File Size (MB) | Code Commit                                                    |
|:------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------|----------:|-----------------:|:---------------------------------------------------------------|
| large-v2 (WhisperOpenAIAPI)                                                                                                   | [2.35](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/librispeech)              |     100   |             3100 | N/A                                                            |
| [large-v2](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v2)                                       | [2.77](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2/librispeech)                    |      96.6 |             3100 | [Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9) |
| [large-v2_949MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v2_949MB)                           | [2.4](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_949MB/librispeech)               |      94.6 |              949 | [Link](https://github.com/argmaxinc/WhisperKit/commit/eca4a2e) |
| [large-v2_turbo](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v2_turbo)                           | [2.76](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo/librispeech)              |      96.6 |             3100 | [Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9) |
| [large-v2_turbo_955MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v2_turbo_955MB)               | [2.41](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v2_turbo_955MB/librispeech)        |      94.6 |              955 | [Link](https://github.com/argmaxinc/WhisperKit/commit/cf75348) |
| [large-v3](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v3)                                       | [2.04](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/librispeech)                    |      95.2 |             3100 | [Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9) |
| [large-v3_turbo](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v3_turbo)                           | [2.03](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo/librispeech)              |      95.4 |             3100 | [Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9) |
| [large-v3_turbo_954MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v3_turbo_954MB)               | [2.47](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3_turbo_954MB/librispeech)        |      93.9 |              954 | [Link](https://github.com/argmaxinc/WhisperKit/commit/cf75348) |
| [distil-large-v3](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3)                         | [2.47](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3/librispeech)             |      89.7 |             1510 | [Link](https://github.com/argmaxinc/WhisperKit/commit/cf75348) |
| [distil-large-v3_594MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3_594MB)             | [2.96](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3_594MB/librispeech)       |      85.4 |              594 | [Link](https://github.com/argmaxinc/WhisperKit/commit/508240f) |
| [distil-large-v3_turbo](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3_turbo)             | [2.47](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3_turbo/librispeech)       |      89.7 |             1510 | [Link](https://github.com/argmaxinc/WhisperKit/commit/508240f) |
| [distil-large-v3_turbo_600MB](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3_turbo_600MB) | [2.78](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3_turbo_600MB/librispeech) |      86.2 |              600 | [Link](https://github.com/argmaxinc/WhisperKit/commit/ae1cf96) |
| [small.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-small.en)                                       | [3.12](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small.en/librispeech)                    |      85.8 |              483 | [Link](https://github.com/argmaxinc/WhisperKit/commit/228630c) |
| [small](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-small)                                             | [3.45](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-small/librispeech)                       |      83   |              483 | [Link](https://github.com/argmaxinc/WhisperKit/commit/228630c) |
| [base.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-base.en)                                         | [3.98](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/librispeech)                     |      75.3 |              145 | [Link](https://github.com/argmaxinc/WhisperKit/commit/228630c) |
| [base](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-base)                                               | [4.97](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base/librispeech)                        |      67.2 |              145 | [Link](https://github.com/argmaxinc/WhisperKit/commit/228630c) |
| [tiny.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-tiny.en)                                         | [5.61](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/librispeech)                     |      63.9 |               66 | [Link](https://github.com/argmaxinc/WhisperKit/commit/228630c) |
| [tiny](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-tiny)                                               | [7.47](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny/librispeech)                        |      52.5 |               66 | [Link](https://github.com/argmaxinc/WhisperKit/commit/228630c) |

## Dataset: `earnings22`
Long-Form Audio (>1hr/clip) - 120 hours of earnings call recordings in English with various accents

|                                                                                                       | WER (↓)                                                                                                                   |   QoI (↑) |   File Size (MB) | Code Commit                                                    |
|:------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------|----------:|-----------------:|:---------------------------------------------------------------|
| large-v2 (WhisperOpenAIAPI)                                                                           | [16.27](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperOpenAIAPI/openai_whisper-large-v2/earnings22)  |     100   |             3100 | N/A                                                            |
| [large-v3](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-large-v3)               | [15.17](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-large-v3/earnings22)        |      58.5 |             3100 | [Link](https://github.com/argmaxinc/WhisperKit/commit/2846fd9) |
| [distil-large-v3](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/distil-whisper_distil-large-v3) | [15.28](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/distil-whisper_distil-large-v3/earnings22) |      46.3 |             1510 | [Link](https://github.com/argmaxinc/WhisperKit/commit/508240f) |
| [base.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-base.en)                 | [23.49](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-base.en/earnings22)         |       6.5 |              145 | [Link](https://github.com/argmaxinc/WhisperKit/commit/dda6571) |
| [tiny.en](https://hf.co/argmaxinc/whisperkit-coreml/tree/main/openai_whisper-tiny.en)                 | [28.64](https://hf.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/openai_whisper-tiny.en/earnings22)         |       5.7 |               66 | [Link](https://github.com/argmaxinc/WhisperKit/commit/dda6571) |


### Explanation

We believe that rigorously measuring the quality of inference is necessary for developers and
enterprises to make informed decisions when opting to use optimized or compressed variants of
any machine learning model in production. To contextualize `WhisperKit`, we take the following Whisper
implementations and benchmark them using a consistent evaluation harness:

Server-side:
- `WhisperOpenAIAPI`: [OpenAI's Whisper API](https://platform.openai.com/docs/guides/speech-to-text)

($0.36 per hour of audio as of 02/29/24, 25MB file size limit per request)

On-device:
- `WhisperKit`: Argmax's implementation [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L100) [[Repo]](https://github.com/argmaxinc/WhisperKit)
- `whisper.cpp`: A C++ implementation form ggerganov [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L212) [[Repo]](https://github.com/ggerganov/whisper.cpp)
- `WhisperMLX`: A Python implementation from Apple MLX [[Eval Harness]](https://github.com/argmaxinc/whisperkittools/blob/main/whisperkit/pipelines.py#L338) [[Repo]](https://github.com/ml-explore/mlx-examples/blob/main/whisper/whisper/transcribe.py)

(All on-device implementations are available for free under MIT license as of 03/19/2024)

`WhisperOpenAIAPI` sets the reference and we assume that it is using the equivalent of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)
in float16 precision along with additional undisclosed optimizations from OpenAI. In all measurements, we care primarily about per-example no-regressions (quantified as `qoi` below)
which is a stricter metric compared to dataset average [Word Error RATE (WER)](https://en.wikipedia.org/wiki/Word_error_rate). A 100% `qoi` preserves perfect backwards-compatibility on the test distribution and avoids "perceived regressions", the phenomenon
where per-example known behavior changes after a code/model update and causes divergence in downstream code or breaks the user experience itself (even if dataset averages might stay flat
across updates). Pseudocode for `qoi`:

```python
qoi = []
for example in dataset:
    no_regression = wer(optimized_model(example)) <= wer(reference_model(example))
    qoi.append(no_regression)
qoi = (sum(qoi) / len(qoi)) * 100.
```

Note that the ordering of models with respect to `WER` does not necessarily match the ordering with respect to `QoI`. This is because the reference model gets assigned
a QoI of 100% by definition. Any per-example regression by other implementations get penalized while per-example improvements are not rewarded. `QoI` (higher is better) matters
where the production behavior is established by the reference results and the goal is to not regress when switching to an optimized or compressed model. On the other hand,
`WER` (lower is better) matters when there is no established production behavior and one is picking the best quality versus model size trade off point.

We anticipate developers that use Whisper (or similar models) in production to have their own Quality Assurance test sets and [whisperkittools](https://github.com/argmaxinc/whisperkittools) offers
the tooling necessary to run the same measurements on such custom test sets, please see the [Model Evaluation on Custom Dataset]((https://github.com/argmaxinc/whisperkittools)) for details.

### Why are there so many Whisper versions?
WhisperKit is an SDK for building speech-to-text features in apps across a wide range of Apple devices. We are working towards abstracting away the model versioning from the developer so WhisperKit
"just works" by deploying the highest-quality model version that a particular device can execute. In the interim, we leave the choice to the developer by providing quality and size trade-offs.


### Datasets
- [librispeech](https://huggingface.co/datasets/argmaxinc/librispeech): ~5 hours of short English audio clips, tests short-form transcription quality
- [earnings22](https://huggingface.co/datasets/argmaxinc/earnings22): ~120 hours of English audio clips from earnings calls with various accents, tests long-form transcription quality

### Reproducing Results
Benchmark results on this page were automatically generated by [whisperkittools](https://github.com/argmaxinc/whisperkittools) using our cluster of Apple Silicon Macs as self-hosted runners on
Github Actions. We periodically recompute these benchmarks as part of our CI pipeline. Due to [security concerns](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#hardening-for-self-hosted-runners),
we are unable to open up the cluster to the public. However, any Apple Silicon Mac (even with 8GB RAM) can be used to
run identical [evaluation jobs](#evaluation) locally. For reference, our M2 Ultra devices complete a `librispeech` + `openai/whisper-large-v3`
evaluation in under 1 hour regardless of the Whisper implementation. Oldest Apple Silicon Macs should take less than 1 day to complete the same evaluation.



### Glossary

- `_turbo`: Indicates the presence of additional optimizations (not compression) to unlock streaming transcription
as described in our [Blog Post](https://www.takeargmax.com/blog/whisperkit).

- `_*MB`: Indicates the presence of model compression. Instead of cluttering the filename with details like
`_AudioEncoder-5.8bits_TextDecoder-6.1bits_QLoRA-rank=16`, we choose to summarize the compression spec as the
resulting total file size since this is what matters to developers in production.