Update README.md

f47799a verified about 1 month ago

6.25 kB

	---
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	tags:
	- whisper
	- speech
	- swedish
	- telephonic
	- transformers
	datasets:
	- WMRNORDIC/swedish-telephonic-dataset
	metrics:
	- wer
	base_model: openai/whisper-small
	base_model_relation: finetune
	license: apache-2.0
	language:
	- sv
	- en
	model-index:
	- name: whisper-swedish-telephonic
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Swedish Telephonic Dataset
	type: custom
	split: test
	metrics:
	- name: Word Error Rate (WER)
	type: wer
	value: 0.170
	- name: Base Model WER (Comparison)
	type: wer
	value: 0.888
	---

	# whisper-swedish-telephonic

	## Model Overview
	`whisper-swedish-telephonic` is a fine-tuned version of OpenAI's Whisper-Small model, specifically designed for transcribing Swedish telephonic audio. The model is optimized for low-bandwidth, multi-speaker conversations such as call center interactions.

	### Key Features:
	- Language: Swedish (primary), with limited support for minor English segments.
	- Audio Types: Telephonic conversations, customer support recordings, and general low-bandwidth audio.
	- Sample Rate: 8kHz (resampled to 16kHz internally).
	- Special Tokens: Supports conversational markers, disfluencies, and speaker-specific tags.
	- Performance: Demonstrates significantly improved transcription accuracy over the base model for telephonic speech.

	---

	## Dataset
	The model was fine-tuned using the Swedish Telephonic Dataset, consisting of:

	- Duration: ~97 hours of annotated audio.
	- Domains: Call center recordings, customer service conversations.
	- Annotations:
	- Speaker IDs and timestamps.
	- Conversational tags: `(())`, `~`, `<overlap>`.
	- Language switching: `<lang:English>...</lang:English>`.

	### Preprocessing:
	- Audio: Resampled to 16kHz.
	- Segmentations: Aligned with timestamps.
	- Special Tokens: Includes non-speech sounds like `[cough]`, `[laugh]`.

	---

	## Model Performance
	### Word Error Rate (WER) Evaluation
	The fine-tuned model was benchmarked against OpenAI's base Whisper-Small model using a Swedish telephonic test dataset containing 207 labeled speech segments.

	\| Metric \| Fine-Tuned Model \| Base Whisper-Small \|
	\|----------\|------------------\|--------------------\|
	\| WER \| 0.170 \| 0.888 \|

	### Key Observations:
	- Fine-Tuned Model:
	- Excellent transcription accuracy for colloquial Swedish, domain-specific terminology, and long utterances.
	- Handles speaker-specific annotations and conversational markers effectively.
	- Base Model:
	- Struggles with Swedish syntax and domain-specific vocabulary.
	- Outputs nonsensical transcriptions for longer or complex sentences.

	---

	## Example Transcriptions

	\| Segment \| Ground Truth \| Fine-Tuned Model \| Base Model \| WER (Fine-Tuned) \| WER (Base) \|
	\|---------\|---------------------------------------------\|------------------------------------------\|----------------------\|------------------\|------------\|
	\| 1 \| så nu \| så nu \| so, no \| 0.000 \| 1.000 \|
	\| 2 \| nu record du båda va \| nu record du båda va \| nu rekordar du båda \| 0.000 \| 0.400 \|
	\| 3 \| ja jag kommer inte ihåg \| ja jag kommer inte ihåg \| i am coming to you \| 0.000 \| 1.000 \|
	\| 5 \| sen när då, sen alltid... inga gäster \| sen när då, sen alltid... inga gäster \| sen då, sen alltid... ingen gest \| 0.000 \| 0.250 \|
	\| 14 \| till frankrike \| till frankrike \| thank you \| 0.000 \| 1.000 \|

	Note: Full segment-wise evaluation logs are available in the repository.

	---

	## Audio Example
	This audio file demonstrates the model's transcription abilities:

	- File: [trimmed_resampled_audio.wav](https://huggingface.co/WMRNORDIC/whisper-swedish-telephonic/blob/main/trimmed_resampled_audio.wav)
	- Content: Hej du har kommit till Dressmann. Du pratar med Isabelle. Vad kan jag hjälpa dig?
	- Audio Type: Telephonic conversation.
	- Sample Rate: 16kHz (resampled).
	- Purpose: Showcasing the model's capabilities in transcribing Swedish telephonic speech.

	---

	## Intended Use
	This model is designed for:
	- Customer Support Automation: Transcription and analysis of call center recordings.
	- Telephony Analytics: Sentiment analysis, compliance monitoring, and business intelligence.
	- Swedish Language Research: Study of conversational patterns and colloquial expressions.

	### Limitations:
	- Language Support: Primarily Swedish; limited support for English.
	- Audio Quality: Optimized for telephonic audio; performance may degrade with studio-quality or highly noisy audio.
	- Preprocessing Requirement: Requires resampling non-8kHz audio to 16kHz.

	---

	## Try the Model
	You can test the model using the Hugging Face Playground or the dedicated endpoint:

	- Playground: [Test the Model](https://huggingface.co/WMRNORDIC/whisper-swedish-telephonic)
	- Dedicated Endpoint: [Endpoint URL](https://zckhajpu2q8h0sjw.us-east-1.aws.endpoints.huggingface.cloud)

	---

	## How to Use
	```python
	from transformers import WhisperForConditionalGeneration, WhisperProcessor
	import soundfile as sf

	# Load model and processor
	model_name = "WMRNORDIC/whisper-swedish-telephonic"
	model = WhisperForConditionalGeneration.from_pretrained(model_name)
	processor = WhisperProcessor.from_pretrained(model_name)

	# Load and preprocess audio
	audio, sample_rate = sf.read("path_to_audio.wav")
	inputs = processor(audio, sampling_rate=sample_rate, return_tensors="pt")

	# Transcribe
	generated_ids = model.generate(inputs.input_features)
	transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

	print("Transcription:", transcription)