This vits model was trained on the KTH/hungarian-single-speaker-tts dataset.

CSS10 Hungarian: Single Speaker Speech Dataset

The corpus consists of a single speaker, with 4515 segments extracted from this single LibriVox audiobook. It consists about 10 hours of audio data.

Training

The model was trained on a single RTX 3090 GPU for 3 days, 200K steps with a batchsize of 16. We saved some checkpoints with the optimizers, so the model could be train further, however we didn't find any noticable effect after step 150K.

Usage

The model diana_final.pth can be used with JayWalnut's git repo, but you have to modify the text/cleaners.py file to contain our hungarian_cleaners method. We provided the necessary files in our repo to do so.

Downloads last month
15
Inference Examples
Inference API (serverless) has been turned off for this model.

Dataset used to train legekka/diana-hungarian-tts-vits