README.md · ShoukanLabs/Vokan at aa0f5e518bc4011e8e54fda9c9713f0d9959d5e1

metadata

license: mit

A StyleTTS2 fine-tune, designed for expressiveness.

Vokan features:

A diverse dataset for a more authentic zero-shot performance
Training on 6+ days worth of audio, with 672 diverse and expressive speakers
Training on 1x H100 for 300 hours and 1x 3090 for an additional 600 hours

Audio Examples

Demo Spaces

Coming soon...

This model was made possible thanks to

DagsHub who sponsored us with their GPU compute (with special thanks to Dean!)
And the assistance from camenduru on cloud infrastructure and model training

Citations

@misc{li2023styletts,
      title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
      author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani},
      year={2023},
      eprint={2306.07691},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

@misc{zen2019libritts,
      title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech},
      author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
      year={2019},
      eprint={1904.02882},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Christophe Veaux,  Junichi Yamagishi, Kirsten MacDonald,
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",  
The Centre for Speech Technology Research (CSTR),
University of Edinburgh

License

MIT

Stay tuned for Vokan V2!