|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning) |
|
[![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329) |
|
|
|
# Model Details |
|
This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on Clotho and unlabeled Freesound. |
|
|
|
# Dependencies |
|
Install corresponding dependencies to run the model: |
|
```bash |
|
pip install numpy torch torchaudio einops transformers efficientnet_pytorch |
|
``` |
|
|
|
# Usage |
|
```python |
|
import torch |
|
from transformers import AutoModel, PreTrainedTokenizerFast |
|
import torchaudio |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# use the model trained on Clotho |
|
model = AutoModel.from_pretrained( |
|
"wsntxxn/effb2-trm-clotho-captioning", |
|
trust_remote_code=True |
|
).to(device) |
|
tokenizer = PreTrainedTokenizerFast.from_pretrained( |
|
"wsntxxn/clotho-simple-tokenizer" |
|
) |
|
|
|
# inference on a single audio clip |
|
wav, sr = torchaudio.load("/path/to/file.wav") |
|
wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate) |
|
if wav.size(0) > 1: |
|
wav = wav.mean(0).unsqueeze(0) |
|
|
|
with torch.no_grad(): |
|
word_idxs = model( |
|
audio=wav, |
|
audio_length=[wav.size(1)], |
|
) |
|
|
|
caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True) |
|
print(caption) |
|
|
|
# inference on a batch |
|
wav1, sr1 = torchaudio.load("/path/to/file1.wav") |
|
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate) |
|
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0] |
|
|
|
wav2, sr2 = torchaudio.load("/path/to/file2.wav") |
|
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate) |
|
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0] |
|
|
|
wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True) |
|
|
|
with torch.no_grad(): |
|
word_idxs = model( |
|
audio=wav_batch, |
|
audio_length=[wav1.size(0), wav2.size(0)], |
|
) |
|
|
|
captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True) |
|
print(captions) |
|
``` |