wav2vec2-emotion-recognition
This model is fine-tuned on the Wav2Vec2 architecture for speech emotion recognition. It can classify speech into 8 different emotions with corresponding confidence scores.
Model Description
- Model Architecture: Wav2Vec2 with sequence classification head
- Language: English
- Task: Speech Emotion Recognition
- Fine-tuned from: facebook/wav2vec2-base
- Datasets: Combined emotion datasets
- TESS
- CREMA-D
- SAVEE
- RAVDESS
Performance Metrics
- Accuracy: 79.57%
- F1 Score: 79.43%
Supported Emotions
- π Angry
- π Calm
- π€’ Disgust
- π¨ Fearful
- π Happy
- π Neutral
- π’ Sad
- π² Surprised
Training Details
The model was trained with the following configuration:
- Epochs: 15
- Batch Size: 16
- Learning Rate: 5e-5
- Optimizer: AdamW
- Weight Decay: 0.03
- Gradient Accumulation Steps: 2
- Mixed Precision: fp16
For detailed training process, check out the Fine-tuning Notebook
Limitations
Audio Requirements:
- Sampling rate: 16kHz (will be automatically resampled)
- Maximum duration: 1 minute
- Clear speech with minimal background noise recommended
Performance Considerations:
- Best results with clear speech audio
- Performance may vary with different accents
- Background noise can affect accuracy
Demo
https://huggingface.co/spaces/Dpngtm/Audio-Emotion-Recognition
Contact
- GitHub: DGautam11
- LinkedIn: Deepan Gautam
- Hugging Face: @Dpngtm
For issues and questions, feel free to:
- Open an issue on the Model Repository
- Comment on the Demo Space
Usage
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torch
import torchaudio
# Load model and processor
model = Wav2Vec2ForSequenceClassification.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
processor = Wav2Vec2Processor.from_pretrained("Dpngtm/wav2vec2-emotion-recognition")
# Load and preprocess audio
speech_array, sampling_rate = torchaudio.load("path_to_audio.wav")
if sampling_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
speech_array = resampler(speech_array)
# Convert to mono if stereo
if speech_array.shape[0] > 1:
speech_array = torch.mean(speech_array, dim=0, keepdim=True)
speech_array = speech_array.squeeze().numpy()
# Process through model
inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Get predicted emotion
emotion_labels = ["angry", "calm", "disgust", "fearful", "happy", "neutral", "sad", "surprised"]
predicted_emotion = emotion_labels[predictions.argmax().item()]
- Downloads last month
- 139