Emotion Detection From Speech

This model is the fine-tuned version of DistilHuBERT which classifies emotions from audio inputs.

Approach

  1. Dataset: The Ravdess dataset, comprising 1,440 audio files with 8 emotion labels: calm, happy, sad, angry, fearful, surprise, neutral, and disgust.
  2. Model Fine-Tuning: The DistilHuBERT model was fine-tuned for 7 epochs with a learning rate of 5e-5, achieving an accuracy of 98% on the test dataset.

Data Preprocessing

  • Sampling Rate: Audio files were resampled to 16kHz to match the model's requirements.
  • Padding: Audio clips shorter than 30 seconds were zero-padded.
  • Train-Test Split: 80% of the samples were used for training, and 20% for testing.

Model Architecture

  • DistilHuBERT: A lightweight variant of HuBERT, fine-tuned for emotion classification.
  • Fine-Tuning Setup:
    • Optimizer: AdamW
    • Loss Function: Cross-Entropy
    • Learning Rate: 5e-5
    • Warm-up Ratio: 0.1
    • Epochs: 7

Results

  • Accuracy: 0.98 on the test dataset
  • Loss: 0.10 on the test dataset

Usage

from transformers import pipeline

pipe = pipeline(
    "audio-classification",
    model="BilalHasan/distilhubert-finetuned-ravdess",
)

emotion = pipe(path_to_your_audio)

Demo

You can access the live demo of the app on Hugging Face Spaces.

Downloads last month
13
Safetensors
Model size
23.7M params
Tensor type
F32
ยท
Inference Examples
Inference API (serverless) does not yet support flair models for this pipeline type.

Model tree for BilalHasan/distilhubert-finetuned-ravdess

Finetuned
(421)
this model

Space using BilalHasan/distilhubert-finetuned-ravdess 1