Priorities
Owner Priorities
Train the v1 model:
- Detach completely from
espeak-ng
, which requires rebuilding g2p / tokenization from first principles - Migrate training set towards MLLM synthetic audio (4o, Gemini 2 Flash), instead of synthetic audio from traditional TTS models
If both of the above objectives are completed and there is a successful training run, it is likely the resulting model would earn the v1 name—however this is not a guarantee and v1 has no scheduled release date yet. If you want to help with the 2nd objective of growing the synthetic dataset, consider joining the Discord server.
Because my attention is diverted above, I may not have bandwidth to address the issues below. Contributions are welcome, including those not listed.
Top Priority
- Enable FP16 Inference
Quality of Life
- Long-form inference code, so generations don't truncate after 510 tokens. There is already LF inference code (MIT) in the hosted demo, but I have not had a chance to port it over yet, and that can still probably be improved upon.
- pip installable package
Long Term
- Crowdsourced data collection (more on this in a separate post)
- Explore other architectures. StyleTTS-ZS could warrant attention on the author name alone, but the author appears to have abandoned it and I have not had time to unpack the notebooks yet.
Third Party / Arenas
-
llama.cpp
support, similar to https://github.com/ggerganov/llama.cpp/pull/10784 - https://artificialanalysis.ai/text-to-speech/arena
- https://hf.co/spaces/TTS-AGI/TTS-Arena
- https://hf.co/spaces/Pendrokar/TTS-Spaces-Arena
Again, this list is not comprehensive and likely will evolve over time, so feel free to contribute in ways you don't see specified above if you think it would be helpful.
hi, any plans to port code to C++ make the Kokoro.CPP via Georgi Gerganov GGML lib,for more speed, memory efficiency, and no dependence on overbloated python libs,
and model quant s support
I clarified where my own priorities currently lie under "Owner Priorities".
This would be nice, but I do not have the bandwidth to do this:
hi, any plans to port code to C++ make the Kokoro.CPP via Georgi Gerganov GGML lib,for more speed, memory efficiency, and no dependence on overbloated python libs,
Consider opening an issue in llama.cpp
if you think it would be appropriate. I saw this was done recently for OuteTTS in https://github.com/ggerganov/llama.cpp/pull/10784 so maybe it can be done for Kokoro models as well. As mentioned elsewhere, the inference code is deliberately thinned relative to the full StyleTTS2 to (hopefully) improve readability.
and model quant s support
Have to walk before you can run. FP32 inference works, but I have not cracked FP16 inference yet, which is the top priority (after v1 model training). I think FP16 inference should work if fixed, because as you can see in the linked issue, the generated samples sound fine if we run inference against a half precision, 160 MB model file.
- pip installable package
Instead of that, you can also implement inference pipeline using the transformers
API.