Automatic Speech Recognition
PyTorch
allophant
phoneme-recognition

Model Information

Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.

The model is based on facebook/wav2vec2-xls-r-300m and was pre-trained on a subset of the Common Voice Corpus 10.0 transcribed with eSpeak NG.

Model Name UCLA Phonetic Corpus (PER) UCLA Phonetic Corpus (AER) Common Voice (PER) Common Voice (AER)
Multitask 45.62% 19.44% 34.34% 8.36%
Hierarchical 46.09% 19.18% 34.35% 8.56%
Multitask Shared 46.05% 19.52% 41.20% 8.88%
Baseline Shared 48.25% - 45.35% -
Baseline 57.01% - 46.95% -

Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.

Usage

Install the allophant package:

pip install allophant

A pre-trained model can be loaded from a huggingface checkpoint or local file:

from allophant.estimator import Estimator

device = "cpu"
model, attribute_indexer = Estimator.restore("kgnlp/allophant", device=device)
supported_features = attribute_indexer.feature_names
# The phonetic feature categories supported by the model, including "phonemes"
print(supported_features)

Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:

# 1. For a single language:
inventory = attribute_indexer.phoneme_inventory("es")
# 2. For multiple languages, e.g. in code-switching scenarios
inventory = attribute_indexer.phoneme_inventory(["es", "it"])
# 3. Any custom selection of phones for which features are available in the Allophoible database
inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']

Audio files can then be loaded, resampled and transcribed using the given inventory by first computing the log probabilities for each classifier:

import torch
import torchaudio
from allophant.dataset_processing import Batch

# Load an audio file and resample the first channel to the sample rate used by the model
audio, sample_rate = torchaudio.load("utterance.wav")
audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)

# Construct a batch of 0-padded single channel audio, lengths and language IDs
# Language ID can be 0 for inference
batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
model_outputs = model.predict(
  batch.to(device),
  attribute_indexer.composition_feature_matrix(inventory).to(device)
)

Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:

from allophant import predictions

# Create a feature mapping for your inventory and CTC decoders for the desired feature set
inventory_indexer = attribute_indexer.attributes.subset(inventory)
ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)

for feature_name, decoder in ctc_decoders.items():
    decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
    # Print the feature name and values for each utterance in the batch
    for [hypothesis] in decoded:
        # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
        recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
        print(feature_name, recognized)

Citation

@inproceedings{glocker2023allophant,
    title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
    author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
    year={2023},
    booktitle={{Proc. Interspeech 2023}},
    month={8}}

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Inference API (serverless) does not yet support allophant models for this pipeline type.

Model tree for kgnlp/allophant

Finetuned
(530)
this model

Dataset used to train kgnlp/allophant

Collection including kgnlp/allophant