The AV_MossFormer2_TSE_16K model weights for 16 kHz audio-visual target speaker extraction in ClearerVoice-Studio repo.
This model is trained on large scale open-sourced datasets.
It extracts each speaker's voice from a multi-speaker video using facial recognition.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.