wsntxxn
/

cnn8rnn-w2vmean-audiocaps-grounding

Audio Classification

feature-extraction

Model card Files Files and versions Community

cnn8rnn-w2vmean-audiocaps-grounding / README.md

wsntxxn's picture

Update README.md

a9eb20c verified 6 months ago

|

history blame contribute delete

1.82 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- jp1924/AudioCaps
	language:
	- en
	pipeline_tag: audio-classification
	---

	[![arXiv](https://img.shields.io/badge/arXiv-2401.02584-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2401.02584)

	# Model Details

	This is a text-to-audio grounding model.
	Given an audio clip and a text prompt describing a sound event, the model predicts the event's probability with a time resolution of 40ms.

	It is trained on [AudioCaps](https://github.com/cdjkim/audiocaps).
	It takes a simple architecture: Cnn8Rnn audio encoder + single embedding layer text encoder.

	# Usage
	```python
	import torch
	import torchaudio
	from transformers import AutoModel


	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = AutoModel.from_pretrained(
	"wsntxxn/cnn8rnn-w2vmean-audiocaps-grounding",
	trust_remote_code=True
	).to(device)

	wav1, sr1 = torchaudio.load("/path/to/file1.wav")
	wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
	wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]

	wav2, sr2 = torchaudio.load("/path/to/file2.wav")
	wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
	wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]

	wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True).to(device)

	text = ["a man speaks", "a dog is barking"]

	with torch.no_grad():
	output = model(
	audio=wav_batch,
	audio_len=[wav1.size(0), wav2.size(0)],
	text=text
	)
	# output: (2, n_seconds * 25)
	```

	# Citation
	```BibTeX
	@article{xu2024towards,
	title={Towards Weakly Supervised Text-to-Audio Grounding},
	author={Xu, Xuenan and Ma, Ziyang and Wu, Mengyue and Yu, Kai},
	journal={arXiv preprint arXiv:2401.02584},
	year={2024}
	}
	```