Voice Detection AI - Real vs AI Audio Classifier

image/webp

Model Overview

This model is a fine-tuned Wav2Vec2-based audio classifier capable of distinguishing between real human voices and AI-generated voices. It has been trained on a dataset containing samples from various TTS models and real human audio recordings.


Model Details

  • Architecture: Wav2Vec2ForSequenceClassification
  • Fine-tuned on: Custom dataset with real and AI-generated audio
  • Classes:
    1. Real Human Voice
    2. AI-generated (e.g., Melgan, DiffWave, etc.)
  • Input Requirements:
    • Audio format: .wav, .mp3, etc.
    • Sample rate: 16kHz
    • Max duration: 10 seconds (longer audios are truncated, shorter ones are padded)

Performance

  • Robustness: Successfully classifies across multiple AI-generation models.
  • Limitations: Struggles with certain unseen AI-generation models (e.g., ElevenLabs).

How to Use

1. Install Dependencies

Make sure you have transformers and torch installed:

pip install transformers torch torchaudio

Usage

Here's how to use VoiceGUARD for audio classification:

import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torchaudio

# Load model and processor
model_name = "Mrkomiljon/voiceGUARD"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)

# Load audio
waveform, sample_rate = torchaudio.load("path_to_audio_file.wav")

# Resample if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Preprocess
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True)

# Inference
with torch.no_grad():
    logits = model(**inputs).logits
    predicted_ids = torch.argmax(logits, dim=-1)

# Map to label
labels = ["Real Human Voice", "AI-generated"]
prediction = labels[predicted_ids.item()]
print(f"Prediction: {prediction}")

Training Procedure

  • Data Collection: Compiled a balanced dataset of real human voices and AI-generated samples from various TTS models.
  • Preprocessing: Standardized audio formats, resampled to 16 kHz, and adjusted durations to 10 seconds.
  • Fine-Tuning: Utilized the Wav2Vec2 architecture for sequence classification, training for 3 epochs with a learning rate of 1e-5.

Evaluation

  • Metrics: Accuracy, Precision, Recall
  • Results: Achieved 99.8% validation accuracy on the test set.

Limitations and Future Work

  • While VoiceGUARD performs robustly across known AI-generation models, it may encounter challenges with novel or unseen models.
  • Future work includes expanding the training dataset with samples from emerging TTS technologies to enhance generalization.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

  • Special thanks to the developers of the Wav2Vec2 model and the contributors to the datasets used in this project.
  • View the complete project on GitHub
Downloads last month
33
Safetensors
Model size
94.6M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Mrkomiljon/voiceGUARD

Finetuned
(693)
this model

Dataset used to train Mrkomiljon/voiceGUARD

Space using Mrkomiljon/voiceGUARD 1