Voice Detection AI - Real vs AI Audio Classifier
Model Overview
This model is a fine-tuned Wav2Vec2-based audio classifier capable of distinguishing between real human voices and AI-generated voices. It has been trained on a dataset containing samples from various TTS models and real human audio recordings.
Model Details
- Architecture: Wav2Vec2ForSequenceClassification
- Fine-tuned on: Custom dataset with real and AI-generated audio
- Classes:
- Real Human Voice
- AI-generated (e.g., Melgan, DiffWave, etc.)
- Input Requirements:
- Audio format:
.wav
,.mp3
, etc. - Sample rate: 16kHz
- Max duration: 10 seconds (longer audios are truncated, shorter ones are padded)
- Audio format:
Performance
- Robustness: Successfully classifies across multiple AI-generation models.
- Limitations: Struggles with certain unseen AI-generation models (e.g., ElevenLabs).
How to Use
1. Install Dependencies
Make sure you have transformers
and torch
installed:
pip install transformers torch torchaudio
Usage
Here's how to use VoiceGUARD for audio classification:
import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torchaudio
# Load model and processor
model_name = "Mrkomiljon/voiceGUARD"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)
# Load audio
waveform, sample_rate = torchaudio.load("path_to_audio_file.wav")
# Resample if necessary
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
# Preprocess
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
# Inference
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
# Map to label
labels = ["Real Human Voice", "AI-generated"]
prediction = labels[predicted_ids.item()]
print(f"Prediction: {prediction}")
Training Procedure
- Data Collection: Compiled a balanced dataset of real human voices and AI-generated samples from various TTS models.
- Preprocessing: Standardized audio formats, resampled to 16 kHz, and adjusted durations to 10 seconds.
- Fine-Tuning: Utilized the Wav2Vec2 architecture for sequence classification, training for 3 epochs with a learning rate of 1e-5.
Evaluation
- Metrics: Accuracy, Precision, Recall
- Results: Achieved 99.8% validation accuracy on the test set.
Limitations and Future Work
- While VoiceGUARD performs robustly across known AI-generation models, it may encounter challenges with novel or unseen models.
- Future work includes expanding the training dataset with samples from emerging TTS technologies to enhance generalization.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgements
- Downloads last month
- 33
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for Mrkomiljon/voiceGUARD
Base model
facebook/wav2vec2-base