|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
datasets: |
|
- jp1924/AudioCaps |
|
language: |
|
- en |
|
pipeline_tag: audio-classification |
|
--- |
|
|
|
[![arXiv](https://img.shields.io/badge/arXiv-2401.02584-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2401.02584) |
|
|
|
# Model Details |
|
|
|
This is a text-to-audio grounding model. |
|
Given an audio clip and a text prompt describing a sound event, the model predicts the event's probability with a time resolution of 40ms. |
|
|
|
It is trained on [AudioCaps](https://github.com/cdjkim/audiocaps). |
|
It takes a simple architecture: Cnn8Rnn audio encoder + single embedding layer text encoder. |
|
|
|
# Usage |
|
```python |
|
import torch |
|
import torchaudio |
|
from transformers import AutoModel |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = AutoModel.from_pretrained( |
|
"wsntxxn/cnn8rnn-w2vmean-audiocaps-grounding", |
|
trust_remote_code=True |
|
).to(device) |
|
|
|
wav1, sr1 = torchaudio.load("/path/to/file1.wav") |
|
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate) |
|
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0] |
|
|
|
wav2, sr2 = torchaudio.load("/path/to/file2.wav") |
|
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate) |
|
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0] |
|
|
|
wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True).to(device) |
|
|
|
text = ["a man speaks", "a dog is barking"] |
|
|
|
with torch.no_grad(): |
|
output = model( |
|
audio=wav_batch, |
|
audio_len=[wav1.size(0), wav2.size(0)], |
|
text=text |
|
) |
|
# output: (2, n_seconds * 25) |
|
``` |
|
|
|
# Citation |
|
```BibTeX |
|
@article{xu2024towards, |
|
title={Towards Weakly Supervised Text-to-Audio Grounding}, |
|
author={Xu, Xuenan and Ma, Ziyang and Wu, Mengyue and Yu, Kai}, |
|
journal={arXiv preprint arXiv:2401.02584}, |
|
year={2024} |
|
} |
|
``` |