[FAQ] Alternatives to Finetuning Kokoro
A very frequently asked question is how to finetune Kokoro, i.e. continue training the checkpoint uploaded in this repository. This is currently not feasible: for a number of reasons, the additional models required to facilitate this have not yet been open-sourced.
However, there are a few alternatives available to you in the meantime (staying in the realm of open source speech models). Please be aware that there are varying degrees of difficulty to each option, and some could be unsuitable depending on your technical abilities and/or requirements.
1. Use a Speech-to-Speech model like RVC
First, generate the speech using Kokoro. Then pipe the TTS output into RVC-Project/Retrieval-based-Voice-Conversion-WebUI or a similar speech-to-speech model (Beatrice v2, see also w-okada/voice-changer).
Pros: There are many pre-trained RVC models readily available (search "rvc models"), and you can also train your own RVC model.
Cons: You have to run a separate model after TTS, which impacts latency and increases inference-time compute footprint. Results will also probably fall short of a true base model finetune.
2. Train your own StyleTTS 2 Model
Kokoro v0.19 was trained on relatively little data and transparently uses a StyleTTS 2 architecture, so if you have proficiency in training models and the required compute, you can train your own from the public checkpoints. Here are some resources:
- StyleTTS 2 repo: https://github.com/yl4579/StyleTTS2
- Notes on Finetuning: https://github.com/yl4579/StyleTTS2/discussions/81
- Colab notebooks: https://github.com/yl4579/StyleTTS2/tree/main/Colab
It is also possible to train StyleTTS 2 models in other languages, although this is can be more difficult than English for tokenization & g2p and/or data procurement:
- @Respair has done Japanese: https://huggingface.co/spaces/Respair/Tsukasa_Speech
- @patriotyk has done Ukranian: https://huggingface.co/spaces/patriotyk/styletts2-ukrainian
- I have heard (of) models in: Korean, German, French, Italian, Spanish, Persian
Pros: Full customizability and ownership over your trained model.
Cons: Requires compute, data, and technical skills.
3. Zero-shot or train a different TTS architecture
In no particular order, here are some links to other open-source TTS models (although not all are permissive):
- XTTS v2: https://hf.co/coqui/XTTS-v2
- MaskGCT: https://hf.co/amphion/MaskGCT
- E2/F5-TTS: https://github.com/SWivid/F5-TTS
- GPT-SoVITS: https://github.com/RVC-Boss/GPT-SoVITS
- Fish Speech: https://hf.co/fishaudio/fish-speech-1.5
- Piper TTS: https://github.com/rhasspy/piper
Pros: Training may not be required, or in some cases if the base models were trained on more data, less training may be required to finetune.
Cons: Licenses, parameter counts, and resulting output quality vary.
βHi, the styleTTS requires diffusion, while Kokoros does not. Does this indicate that there are some differences in their architectures?β
regardless of whether the additional models required to facilitate finetuning have not yet been open-sourced - please share the process, so we can make the decision for ourselves about the legal applicability / fair use of the model licenses in our individual jurisdictions, rather than making your own legal judgment on our behalf.
βHi, the styleTTS requires diffusion, while Kokoros does not. Does this indicate that there are some differences in their architectures?β
@MonolithFoundation Yes, Kokoro quite transparently omits the style diffusion element of StyleTTS2, as I personally do not believe it is worth the ~25M additional parameters, but I could be wrong about that.
regardless of whether the additional models required to facilitate finetuning have not yet been open-sourced - please share the process, so we can make the decision for ourselves about the legal applicability / fair use of the model licenses in our individual jurisdictions, rather than making your own legal judgment on our behalf.
@erichartford Kokoro's Data provenance is addressed in https://hf.co/hexgrad/Kokoro-82M#training-details already. Due to the reasoning in https://hf.co/hexgrad/Kokoro-82M/discussions/21#67814dc92af1d47cdd6ac407 I likely cannot be more specific than that. I am not a lawyer, and I do not make legal judgments on your behalf. I simply provide the facts as they are, to the extent that I can. The model is also licensed under Apache 2.0, and it does not really get more permissive than that. You are always free to not use the model if you have any reservations.
My question is not about licensing.
I am asking you to give us the procedure to finetune the model.
You quote upstream models licensing concerns, as the reason you won't provide information about how to finetune the model.
But those licenses don't have any clauses that prevent you from sharing the information about how to finetune your model.
Yeah, am second to the training procedure of the model.
I believe if the training gets opensource, we can build a even better Kokoro-series model.
One can go fast, but a group of people, can go further.
Does local deployment only support English?