wsntxxn
/

effb2-trm-clotho-captioning

Feature Extraction

Model card Files Files and versions Community

effb2-trm-clotho-captioning / README.md

wsntxxn's picture

Update README.md

f4539fe verified 5 months ago

|

history blame contribute delete

2.18 kB

	---
	license: apache-2.0
	language:
	- en
	---
	[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
	[![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)

	# Model Details
	This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on Clotho and unlabeled Freesound.

	# Dependencies
	Install corresponding dependencies to run the model:
	```bash
	pip install numpy torch torchaudio einops transformers efficientnet_pytorch
	```

	# Usage
	```python
	import torch
	from transformers import AutoModel, PreTrainedTokenizerFast
	import torchaudio


	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# use the model trained on Clotho
	model = AutoModel.from_pretrained(
	"wsntxxn/effb2-trm-clotho-captioning",
	trust_remote_code=True
	).to(device)
	tokenizer = PreTrainedTokenizerFast.from_pretrained(
	"wsntxxn/clotho-simple-tokenizer"
	)

	# inference on a single audio clip
	wav, sr = torchaudio.load("/path/to/file.wav")
	wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
	if wav.size(0) > 1:
	wav = wav.mean(0).unsqueeze(0)

	with torch.no_grad():
	word_idxs = model(
	audio=wav,
	audio_length=[wav.size(1)],
	)

	caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
	print(caption)

	# inference on a batch
	wav1, sr1 = torchaudio.load("/path/to/file1.wav")
	wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
	wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]

	wav2, sr2 = torchaudio.load("/path/to/file2.wav")
	wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
	wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]

	wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)

	with torch.no_grad():
	word_idxs = model(
	audio=wav_batch,
	audio_length=[wav1.size(0), wav2.size(0)],
	)

	captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True)
	print(captions)
	```