|
--- |
|
library_name: transformers |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- whisper |
|
- speech |
|
- swedish |
|
- telephonic |
|
- transformers |
|
datasets: |
|
- WMRNORDIC/swedish-telephonic-dataset |
|
metrics: |
|
- wer |
|
base_model: openai/whisper-small |
|
base_model_relation: finetune |
|
license: apache-2.0 |
|
language: |
|
- sv |
|
- en |
|
model-index: |
|
- name: whisper-swedish-telephonic |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Swedish Telephonic Dataset |
|
type: custom |
|
split: test |
|
metrics: |
|
- name: Word Error Rate (WER) |
|
type: wer |
|
value: 0.170 |
|
- name: Base Model WER (Comparison) |
|
type: wer |
|
value: 0.888 |
|
--- |
|
|
|
# whisper-swedish-telephonic |
|
|
|
## Model Overview |
|
**`whisper-swedish-telephonic`** is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions. |
|
|
|
### Key Features: |
|
- **Language:** Swedish (primary), with limited support for minor English segments. |
|
- **Audio Types:** Telephonic conversations, customer support recordings, and general low-bandwidth audio. |
|
- **Sample Rate:** 8kHz (resampled to 16kHz internally). |
|
- **Special Tokens:** Supports conversational markers, disfluencies, and speaker-specific tags. |
|
- **Performance:** Demonstrates significantly improved transcription accuracy over the base model for telephonic speech. |
|
|
|
--- |
|
|
|
## Dataset |
|
The model was fine-tuned using the **Swedish Telephonic Dataset**, consisting of: |
|
|
|
- **Duration:** ~97 hours of annotated audio. |
|
- **Domains:** Call center recordings, customer service conversations. |
|
- **Annotations:** |
|
- Speaker IDs and timestamps. |
|
- Conversational tags: `(())`, `~`, `<overlap>`. |
|
- Language switching: `<lang:English>...</lang:English>`. |
|
|
|
### Preprocessing: |
|
- **Audio:** Resampled to 16kHz. |
|
- **Segmentations:** Aligned with timestamps. |
|
- **Special Tokens:** Includes non-speech sounds like `[cough]`, `[laugh]`. |
|
|
|
--- |
|
|
|
## Model Performance |
|
### Word Error Rate (WER) Evaluation |
|
The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments. |
|
|
|
| Metric | Fine-Tuned Model | Base Whisper-Small | |
|
|----------|------------------|--------------------| |
|
| **WER** | 0.170 | 0.888 | |
|
|
|
### Key Observations: |
|
- **Fine-Tuned Model:** |
|
- Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances. |
|
- Handles speaker-specific annotations and conversational markers effectively. |
|
- **Base Model:** |
|
- Struggles with Swedish syntax and domain-specific vocabulary. |
|
- Outputs nonsensical transcriptions for longer or complex sentences. |
|
|
|
--- |
|
|
|
## Example Transcriptions |
|
|
|
| Segment | Ground Truth | Fine-Tuned Model | Base Model | WER (Fine-Tuned) | WER (Base) | |
|
|---------|---------------------------------------------|------------------------------------------|----------------------|------------------|------------| |
|
| 1 | så nu | så nu | so, no | 0.000 | 1.000 | |
|
| 2 | nu record du båda va | nu record du båda va | nu rekordar du båda | 0.000 | 0.400 | |
|
| 3 | ja jag kommer inte ihåg | ja jag kommer inte ihåg | i am coming to you | 0.000 | 1.000 | |
|
| 5 | sen när då, sen alltid... inga gäster | sen när då, sen alltid... inga gäster | sen då, sen alltid... ingen gest | 0.000 | 0.250 | |
|
| 14 | till frankrike | till frankrike | thank you | 0.000 | 1.000 | |
|
|
|
**Note:** Full segment-wise evaluation logs are available in the repository. |
|
|
|
--- |
|
|
|
## Audio Example |
|
This audio file demonstrates the model's transcription abilities: |
|
|
|
- **File:** [trimmed_resampled_audio.wav](https://huggingface.co/WMRNORDIC/whisper-swedish-telephonic/blob/main/trimmed_resampled_audio.wav) |
|
- **Content:** *Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?* |
|
- **Audio Type:** Telephonic conversation. |
|
- **Sample Rate:** 16kHz (resampled). |
|
- **Purpose:** Showcasing the model's capabilities in transcribing Swedish telephonic speech. |
|
|
|
--- |
|
|
|
## Intended Use |
|
This model is designed for: |
|
- **Customer Support Automation:** Transcription and analysis of call center recordings. |
|
- **Telephony Analytics:** Sentiment analysis, compliance monitoring, and business intelligence. |
|
- **Swedish Language Research:** Study of conversational patterns and colloquial expressions. |
|
|
|
### Limitations: |
|
- **Language Support:** Primarily Swedish; limited support for English. |
|
- **Audio Quality:** Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio. |
|
- **Preprocessing Requirement:** Requires resampling non-8kHz audio to 16kHz. |
|
|
|
--- |
|
|
|
## Try the Model |
|
You can test the model using the Hugging Face Playground or the dedicated endpoint: |
|
|
|
- **Playground:** [Test the Model](https://huggingface.co/WMRNORDIC/whisper-swedish-telephonic) |
|
- **Dedicated Endpoint:** [Endpoint URL](https://zckhajpu2q8h0sjw.us-east-1.aws.endpoints.huggingface.cloud) |
|
|
|
--- |
|
|
|
## How to Use |
|
```python |
|
from transformers import WhisperForConditionalGeneration, WhisperProcessor |
|
import soundfile as sf |
|
|
|
# Load model and processor |
|
model_name = "WMRNORDIC/whisper-swedish-telephonic" |
|
model = WhisperForConditionalGeneration.from_pretrained(model_name) |
|
processor = WhisperProcessor.from_pretrained(model_name) |
|
|
|
# Load and preprocess audio |
|
audio, sample_rate = sf.read("path_to_audio.wav") |
|
inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt") |
|
|
|
# Transcribe |
|
generated_ids = model.generate(inputs.input_features) |
|
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
print("Transcription:", transcription) |
|
|