File size: 1,817 Bytes
acaefeb 0d0f2ac acaefeb a9eb20c 0d0f2ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
---
library_name: transformers
license: apache-2.0
datasets:
- jp1924/AudioCaps
language:
- en
pipeline_tag: audio-classification
---
[![arXiv](https://img.shields.io/badge/arXiv-2401.02584-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2401.02584)
# Model Details
This is a text-to-audio grounding model.
Given an audio clip and a text prompt describing a sound event, the model predicts the event's probability with a time resolution of 40ms.
It is trained on [AudioCaps](https://github.com/cdjkim/audiocaps).
It takes a simple architecture: Cnn8Rnn audio encoder + single embedding layer text encoder.
# Usage
```python
import torch
import torchaudio
from transformers import AutoModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(
"wsntxxn/cnn8rnn-w2vmean-audiocaps-grounding",
trust_remote_code=True
).to(device)
wav1, sr1 = torchaudio.load("/path/to/file1.wav")
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]
wav2, sr2 = torchaudio.load("/path/to/file2.wav")
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]
wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True).to(device)
text = ["a man speaks", "a dog is barking"]
with torch.no_grad():
output = model(
audio=wav_batch,
audio_len=[wav1.size(0), wav2.size(0)],
text=text
)
# output: (2, n_seconds * 25)
```
# Citation
```BibTeX
@article{xu2024towards,
title={Towards Weakly Supervised Text-to-Audio Grounding},
author={Xu, Xuenan and Ma, Ziyang and Wu, Mengyue and Yu, Kai},
journal={arXiv preprint arXiv:2401.02584},
year={2024}
}
``` |