Text-to-Speech
English
Vokan / README.md
Korakoe's picture
Update README.md
aa0f5e5 verified
|
raw
history blame
2.8 kB
metadata
license: mit

A StyleTTS2 fine-tune, designed for expressiveness.


Vokan features:

  • A diverse dataset for a more authentic zero-shot performance
  • Training on 6+ days worth of audio, with 672 diverse and expressive speakers
  • Training on 1x H100 for 300 hours and 1x 3090 for an additional 600 hours

Audio Examples

Demo Spaces

Coming soon...

This model was made possible thanks to

  • DagsHub who sponsored us with their GPU compute (with special thanks to Dean!)
  • And the assistance from camenduru on cloud infrastructure and model training

discord

Citations

@misc{li2023styletts,
      title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
      author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani},
      year={2023},
      eprint={2306.07691},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

@misc{zen2019libritts,
      title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech},
      author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
      year={2019},
      eprint={1904.02882},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Christophe Veaux,  Junichi Yamagishi, Kirsten MacDonald,
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",  
The Centre for Speech Technology Research (CSTR),
University of Edinburgh

License

MIT

Stay tuned for Vokan V2!