--- license: mit tags: - vits - vits istft - istft pipeline_tag: text-to-speech --- # VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech VITS is an end-to-end speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. This repository contains the weights for the official VITS checkpoint trained on the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset. # VITS ISTFT: New decoder synthesized speech as natural as that synthesized by VITS while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than original VITS. Suitable for real-time and edge device applications | Checkpoint | Train Hours | Speakers | |------------|-------------|----------| | [ljspeech_vits_ms_istft](https://huggingface.co/anhnct/ljspeech_vits_ms_istft) | 24 | 1 | | [ljspeech_vits_mb_istft](https://huggingface.co/anhnct/ljspeech_vits_mb_istft) | 24 | 1 | | [ljspeech_vits_istft](https://huggingface.co/anhnct/ljspeech_vits_istft) | 24 | 1 | ## Usage To use this checkpoint, first install the latest version of the library: ``` pip install --upgrade transformers accelerate ``` Then, run inference with the following code-snippet: ```python from transformers import AutoModel, AutoTokenizer import torch import numpy as np model = AutoModel.from_pretrained("anhnct/ljspeech_vits_mb_istft", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("anhnct/ljspeech_vits_mb_istft") text = "Hey, it's Hugging Face on the phone" inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): output = model(**inputs).waveform ``` The resulting waveform can be saved as a `.wav` file: ```python import scipy data_np = output.numpy() data_np_squeezed = np.squeeze(data_np) scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=data_np_squeezed) ``` Or displayed in a Jupyter Notebook / Google Colab: ```python from IPython.display import Audio Audio(data_np_squeezed, rate=model.config.sampling_rate) ``` ## License The model is licensed as [**MIT**](https://github.com/jaywalnut310/vits/blob/main/LICENSE).