metadata
license: apache-2.0
language:
- en
Model Details
This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on Clotho and unlabeled Freesound.
Dependencies
Install corresponding dependencies to run the model:
pip install numpy torch torchaudio einops transformers efficientnet_pytorch
Usage
import torch
from transformers import AutoModel, PreTrainedTokenizerFast
import torchaudio
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# use the model trained on Clotho
model = AutoModel.from_pretrained(
"wsntxxn/effb2-trm-clotho-captioning",
trust_remote_code=True
).to(device)
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"wsntxxn/clotho-simple-tokenizer"
)
# inference on a single audio clip
wav, sr = torchaudio.load("/path/to/file.wav")
wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
if wav.size(0) > 1:
wav = wav.mean(0).unsqueeze(0)
with torch.no_grad():
word_idxs = model(
audio=wav,
audio_length=[wav.size(1)],
)
caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
print(caption)
# inference on a batch
wav1, sr1 = torchaudio.load("/path/to/file1.wav")
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]
wav2, sr2 = torchaudio.load("/path/to/file2.wav")
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]
wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)
with torch.no_grad():
word_idxs = model(
audio=wav_batch,
audio_length=[wav1.size(0), wav2.size(0)],
)
captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True)
print(captions)