Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training

Published on Sep 3, 2024

Authors:

Abstract

One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics effectively into the generated speech. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer's shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.

View arXiv page Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.01668 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.01668 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.01668 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.