File size: 7,540 Bytes
bd94edd
 
ec27144
7fb5e3a
 
 
 
 
bd94edd
 
8c499d8
 
 
 
 
 
 
 
 
 
5d16583
8c499d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd94edd
 
8c499d8
 
1fb02b7
8c499d8
1fb02b7
8c499d8
 
 
 
 
1fb02b7
8c499d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd94edd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8c499d8
 
 
 
 
 
bd94edd
 
 
 
 
 
 
 
 
 
7fb5e3a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
license: apache-2.0
library_name: transformers.js
language:
- en
base_model:
- hexgrad/Kokoro-82M
pipeline_tag: text-to-speech
---

# Kokoro TTS

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

## Table of contents

- [Samples](#samples)
- [Usage](#usage)
  - [JavaScript](#javascript)
  - [Python](#python)
- [Quantizations](#quantizations)

## Samples


> Life is like a box of chocolates. You never know what you're gonna get.


| Voice                    | Nationality | Gender | Sample                                                                                                                                  |
|--------------------------|-------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------|
| Default (`af`)           | American    | Female | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FC0_ZUcNSAxvMwpS8QbnKv.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Bella (`af_bella`)       | American    | Female | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FB_q15Z_FXdgBP9-Hk9oKq.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Nicole (`af_nicole`)     | American    | Female | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FsS8U5lQHkhgX7rwTmy-5w.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Sarah (`af_sarah`)       | American    | Female | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FSokkBiqEqwxLLx_pqvf1p.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Sky (`af_sky`)           | American    | Female | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FIzySGHUtl5mYeFxx1oaRf.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Adam (`am_adam`)         | American    | Male   | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2F9n6myE6--ZsEuF5xDv5eC.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Michael (`am_michael`)   | American    | Male   | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FEPFciGtTU1YUXu8MAw7DX.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Emma (`bf_emma`)         | British     | Female | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FAGEsXs-gyJq3dsyo7PjHo.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Isabella (`bf_isabella`) | British     | Female | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FJEzrrXYJSDcmlEzI7tE0c.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| George (`bm_george`)     | British     | Male   | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2Fnsv4zKB4MX2TvXRxv504k.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| Lewis (`bm_lewis`)       | British     | Male   | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2Fg_mcBl2xTbQl0sbrpZt48.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |


## Usage

### JavaScript

First, install the `kokoro-js` library from [NPM](https://npmjs.com/package/kokoro-js) using:
```bash
npm i kokoro-js
```

You can then generate speech as follows:

```js
import { KokoroTTS } from "kokoro-js";

const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
  dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
  // Use `tts.list_voices()` to list all available voices
  voice: "af_bella",
});
audio.save("audio.wav");
```


### Python

```python
import os
import numpy as np
from onnxruntime import InferenceSession

# Tokens produced by phonemize() and tokenize() in kokoro.py
tokens = [50, 157, 43, 135, 16, 53, 135, 46, 16, 43, 102, 16, 56, 156, 57, 135, 6, 16, 102, 62, 61, 16, 70, 56, 16, 138, 56, 156, 72, 56, 61, 85, 123, 83, 44, 83, 54, 16, 53, 65, 156, 86, 61, 62, 131, 83, 56, 4, 16, 54, 156, 43, 102, 53, 16, 156, 72, 61, 53, 102, 112, 16, 70, 56, 16, 138, 56, 44, 156, 76, 158, 123, 56, 16, 62, 131, 156, 43, 102, 54, 46, 16, 102, 48, 16, 81, 47, 102, 54, 16, 54, 156, 51, 158, 46, 16, 70, 16, 92, 156, 135, 46, 16, 54, 156, 43, 102, 48, 4, 16, 81, 47, 102, 16, 50, 156, 72, 64, 83, 56, 62, 16, 156, 51, 158, 64, 83, 56, 16, 44, 157, 102, 56, 16, 44, 156, 76, 158, 123, 56, 4]

# Context length is 512, but leave room for the pad token 0 at the start & end
assert len(tokens) <= 510, len(tokens)

# Style vector based on len(tokens), ref_s has shape (1, 256)
voices = np.fromfile('./voices/af.bin', dtype=np.float32).reshape(-1, 1, 256)
ref_s = voices[len(tokens)]

# Add the pad ids, and reshape tokens, should now have shape (1, <=512)
tokens = [[0, *tokens, 0]]

model_name = 'model.onnx' # Options: model.onnx, model_fp16.onnx, model_quantized.onnx, model_q8f16.onnx, model_uint8.onnx, model_uint8f16.onnx, model_q4.onnx, model_q4f16.onnx
sess = InferenceSession(os.path.join('onnx', model_name))

audio = sess.run(None, dict(
    input_ids=tokens,
    style=ref_s,
    speed=np.ones(1, dtype=np.float32),
))[0]
```

Optionally, save the audio to a file:
```
import scipy.io.wavfile as wavfile
wavfile.write('audio.wav', 24000, audio[0])
```

## Quantizations

The model is resilient to quantization, enabling efficient high-quality speech synthesis at a fraction of the original model size. 

> How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born.


| Model                                          | Size (MB) | Sample                                                                                                                                  |
|------------------------------------------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------|
| model.onnx (fp32)                              | 326       | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FnjexBuqPzfYUvWgs9eQ-_.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| model_fp16.onnx (fp16)                         | 163       | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2F8Ebl44hMQonZs4MlykExt.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| model_quantized.onnx (8-bit)                   | 92.4      | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2F9SLOt6ETclZ4yRdlJ0VIj.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| model_q8f16.onnx (Mixed precision)             | 86        | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FgNDMqb33YEmYMbAIv_Grx.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| model_uint8.onnx (8-bit & mixed precision)     | 177       | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FtpOWRHIWwEb0PJX46dCWQ.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| model_uint8f16.onnx (Mixed precision)          | 114       | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2FvtZhABzjP0pvGD7dRb5Vr.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| model_q4.onnx (4-bit matmul)                   | 305       | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2F8FVn0IJIUfccEBWq8Fnw_.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |
| model_q4f16.onnx (4-bit matmul & fp16 weights) | 154       | <audio controls src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F61b253b7ac5ecaae3d1efe0c%2F7DrgWC_1q00s-wUJuG44X.wav%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></audio> |