arxiv:2412.00319

Improving speaker verification robustness with synthetic emotional utterances

Published on Nov 30, 2024

· Submitted by

amanchadha on Dec 3, 2024

Upvote

Authors:

Aman Chadha ,

Abstract

A speaker verification (SV) system offers an authentication service designed to confirm whether a given speech sample originates from a specific speaker. This technology has paved the way for various personalized applications that cater to individual preferences. A noteworthy challenge faced by SV systems is their ability to perform consistently across a range of emotional spectra. Most existing models exhibit high error rates when dealing with emotional utterances compared to neutral ones. Consequently, this phenomenon often leads to missing out on speech of interest. This issue primarily stems from the limited availability of labeled emotional speech data, impeding the development of robust speaker representations that encompass diverse emotional states. To address this concern, we propose a novel approach employing the CycleGAN framework to serve as a data augmentation method. This technique synthesizes emotional speech segments for each specific speaker while preserving the unique vocal identity. Our experimental findings underscore the effectiveness of incorporating synthetic emotional data into the training process. The models trained using this augmented dataset consistently outperform the baseline models on the task of verifying speakers in emotional speech scenarios, reducing equal error rate by as much as 3.64% relative.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter Dec 3, 2024

•

edited Dec 3, 2024

We leverage the CycleGAN framework to synthesize emotional utterances for data augmentation, improving speaker verification (SV) systems' robustness to emotional speech and reducing error rates by up to 3.64%.

Problem and Approach: Speaker verification systems often fail with emotional speech due to limited labeled data. This work uses CycleGAN to transform neutral utterances into synthetic emotional ones while preserving speaker identity.
Key Results: Incorporating synthetic data into training reduced the equal error rate (EER) for emotional utterances by up to 3.64%, narrowing the performance gap between neutral and emotional speech.
Contributions: This is the first application of CycleGAN for SV tasks, demonstrating its potential for handling emotional variability in biometric systems and promoting fairness and robustness.

librarian-bot

Dec 4, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.00319 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.00319 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.00319 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.