Tony4's picture
Update README.md
f47799a verified
---
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- whisper
- speech
- swedish
- telephonic
- transformers
datasets:
- WMRNORDIC/swedish-telephonic-dataset
metrics:
- wer
base_model: openai/whisper-small
base_model_relation: finetune
license: apache-2.0
language:
- sv
- en
model-index:
- name: whisper-swedish-telephonic
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Swedish Telephonic Dataset
type: custom
split: test
metrics:
- name: Word Error Rate (WER)
type: wer
value: 0.170
- name: Base Model WER (Comparison)
type: wer
value: 0.888
---
# whisper-swedish-telephonic
## Model Overview
**`whisper-swedish-telephonic`** is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions.
### Key Features:
- **Language:** Swedish (primary), with limited support for minor English segments.
- **Audio Types:** Telephonic conversations, customer support recordings, and general low-bandwidth audio.
- **Sample Rate:** 8kHz (resampled to 16kHz internally).
- **Special Tokens:** Supports conversational markers, disfluencies, and speaker-specific tags.
- **Performance:** Demonstrates significantly improved transcription accuracy over the base model for telephonic speech.
---
## Dataset
The model was fine-tuned using the **Swedish Telephonic Dataset**, consisting of:
- **Duration:** ~97 hours of annotated audio.
- **Domains:** Call center recordings, customer service conversations.
- **Annotations:**
- Speaker IDs and timestamps.
- Conversational tags: `(())`, `~`, `<overlap>`.
- Language switching: `<lang:English>...</lang:English>`.
### Preprocessing:
- **Audio:** Resampled to 16kHz.
- **Segmentations:** Aligned with timestamps.
- **Special Tokens:** Includes non-speech sounds like `[cough]`, `[laugh]`.
---
## Model Performance
### Word Error Rate (WER) Evaluation
The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments.
| Metric | Fine-Tuned Model | Base Whisper-Small |
|----------|------------------|--------------------|
| **WER** | 0.170 | 0.888 |
### Key Observations:
- **Fine-Tuned Model:**
- Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances.
- Handles speaker-specific annotations and conversational markers effectively.
- **Base Model:**
- Struggles with Swedish syntax and domain-specific vocabulary.
- Outputs nonsensical transcriptions for longer or complex sentences.
---
## Example Transcriptions
| Segment | Ground Truth | Fine-Tuned Model | Base Model | WER (Fine-Tuned) | WER (Base) |
|---------|---------------------------------------------|------------------------------------------|----------------------|------------------|------------|
| 1 | så nu | så nu | so, no | 0.000 | 1.000 |
| 2 | nu record du båda va | nu record du båda va | nu rekordar du båda | 0.000 | 0.400 |
| 3 | ja jag kommer inte ihåg | ja jag kommer inte ihåg | i am coming to you | 0.000 | 1.000 |
| 5 | sen när då, sen alltid... inga gäster | sen när då, sen alltid... inga gäster | sen då, sen alltid... ingen gest | 0.000 | 0.250 |
| 14 | till frankrike | till frankrike | thank you | 0.000 | 1.000 |
**Note:** Full segment-wise evaluation logs are available in the repository.
---
## Audio Example
This audio file demonstrates the model's transcription abilities:
- **File:** [trimmed_resampled_audio.wav](https://huggingface.co/WMRNORDIC/whisper-swedish-telephonic/blob/main/trimmed_resampled_audio.wav)
- **Content:** *Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?*
- **Audio Type:** Telephonic conversation.
- **Sample Rate:** 16kHz (resampled).
- **Purpose:** Showcasing the model's capabilities in transcribing Swedish telephonic speech.
---
## Intended Use
This model is designed for:
- **Customer Support Automation:** Transcription and analysis of call center recordings.
- **Telephony Analytics:** Sentiment analysis, compliance monitoring, and business intelligence.
- **Swedish Language Research:** Study of conversational patterns and colloquial expressions.
### Limitations:
- **Language Support:** Primarily Swedish; limited support for English.
- **Audio Quality:** Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio.
- **Preprocessing Requirement:** Requires resampling non-8kHz audio to 16kHz.
---
## Try the Model
You can test the model using the Hugging Face Playground or the dedicated endpoint:
- **Playground:** [Test the Model](https://huggingface.co/WMRNORDIC/whisper-swedish-telephonic)
- **Dedicated Endpoint:** [Endpoint URL](https://zckhajpu2q8h0sjw.us-east-1.aws.endpoints.huggingface.cloud)
---
## How to Use
```python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import soundfile as sf
# Load model and processor
model_name = "WMRNORDIC/whisper-swedish-telephonic"
model = WhisperForConditionalGeneration.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)
# Load and preprocess audio
audio, sample_rate = sf.read("path_to_audio.wav")
inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt")
# Transcribe
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)