metadata
tags:
- pyannote
- pyannote-audio
- pyannote-audio-pipeline
- speaker-diarization
license: mit
language:
- en
Configuration
This model outlines the setup of a fine-tuned speaker diarization model with synthetic medical audio data.
Before starting, please ensure the requirements are met:
- Install
pyannote.audio
3.1
withpip install pyannote.audio
- Accept
pyannote/segmentation-3.0
user conditions - Accept
pyannote/speaker-diarization-3.1
user conditions - Create access token at
hf.co/settings/tokens
. - Download pytorch_model.bin and config.yaml files into your local directory.
Usage
Load trained segmentation model
import torch
from pyannote.audio import Model
# Load the original architecture, will need to replace with your own auth token
model = Model.from_pretrained("pyannote/segmentation-3.0", use_auth_token=True)
# Path to the downloaded pytorch model
model_path = "models/pyannote_sd_normal"
# Load fine-tuned weights from the pytorch_model.bin file
model.load_state_dict(torch.load(model_path + "/pytorch_model.bin"))
Load fine-tuned speaker diarization pipeline
from pyannote.audio import Pipeline
from pyannote.metrics.diarization import DiarizationErrorRate
from pyannote.audio.pipelines import SpeakerDiarization
# Initialize the pyannote pipeline, will need to replace with your own auth token
pretrained_pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=True)
finetuned_pipeline = SpeakerDiarization(
segmentation=model,
embedding=pretrained_pipeline.embedding,
embedding_exclude_overlap=pretrained_pipeline.embedding_exclude_overlap,
clustering=pretrained_pipeline.klustering,
)
# Load fine-tuned params into the pipeline
finetuned_pipeline.load_params(model_path + "/config.yaml")
GPU usage
if torch.cuda.is_available():
gpu = torch.device("cuda")
finetuned_pipeline.to(gpu)
print("gpu: ", torch.cuda.get_device_name(gpu))
Visualise diarization output
diarization = finetuned_pipeline("path/to/audio.wav")
diarization
View speaker turns, speaker ID, and time
for speech_turn, track, speaker in diarization.itertracks(yield_label=True):
print(f"{speech_turn.start:4.1f} {speech_turn.end:4.1f} {speaker}")
Citations
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}