Philosophy

#5
by hexgrad - opened

Discord Server

There is now a Discord server for Kokoro at https://discord.gg/QuGxSWBfQy. Consider joining to discuss, collaborate, give feedback, and more.

What is Kokoro?

Kokoro is a series of TTS models that seeks the efficient voice frontier, maximizing Elo rating for a given model size (parameter count).

What is Elo rating?

In essence, Elo rating measures direct head-to-head preference. Given two Elo ratings, one can calculate an estimated winrate for a matchup. Elo is by no means a perfect metric nor is it the only one, but it has a nice interpretation and can be calculated across both open and closed models. Refer to the Wikipedia page for more.

Synthetic Data Selection and Contribution

Kokoro's training mix heavily favors synthetic data, and all training data must be permissive/non-copyrighted (refer to the Data section of Training Details). This is a deliberate choice designed to maximize everyone's value out of the permissive Apache 2.0 license. Community contributions to this synthetic data mix are welcomed; more on this in a separate post.

What are Voicepacks?

Voicepacks are inputs to the model that allow you to generate speech in a particular voice. For one model, there are many Voicepacks. If a model is like a phone, then Voicepacks are the apps you download to your phone.

Where is Voice Cloning?

Curb your expectations on this landing to Kokoro in the near future. The audio encoder is currently unreleased, but more saliently the encoder struggles to generalize to OOD (out of distribution) voices anyway, so you're already seeing the best performance from the model in the form of Voicepacks.

I believe voice cloning requires training on more data, which is currently difficult for a few reasons. Consider two objectives for Kokoro models outlined above:

  1. Maximize Elo, minimize param count
  2. Training data must be permissive/non-copyrighted

If you cast a wider net and train on more data, you may break the latter constraint.

But if you stay entirely permissive/non-copyrighted by using large & permissive (but lower quality) datasets like Common Voice or LibriLight, you might drop your Elo rating since the model learns noise and suboptimal audio patterns.

This is a dilemma that is likely best solved by large, synthetic datasets.

There is a third, implicit constraint which is compute cost, which for now can be eaten because the datasets are small and the model is small, but is something to keep an eye on if/when dataset size grows to the thousands of hours. Compute does not grow on trees, but there are reasonable pathways to obtaining more compute should the need arise.

Voice Interpolation

Even though you cannot currently clone voices, you can interpolate between different Voicepacks. The default voice af is an interpolation between Bella and Sarah, as described (with code) in the Evaluation section. This will become more relevant if/when more Voicepacks are released.

These notes are incomplete and may be edited if/when I find more time to write, but the basic philosophy should remain consistent over time. Any large philosophical changes will be signposted appropriately.

Sign up or log in to comment