Update README.md
Browse files
README.md
CHANGED
@@ -6,4 +6,56 @@ language:
|
|
6 |
- en
|
7 |
---
|
8 |
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
|
9 |
-
[![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
- en
|
7 |
---
|
8 |
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wsntxxn/efficient_audio_captioning)
|
9 |
+
[![arXiv](https://img.shields.io/badge/arXiv-2407.14329-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2407.14329)
|
10 |
+
|
11 |
+
# Model Details
|
12 |
+
This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on AudioCaps and unlabeled AudioSet.
|
13 |
+
|
14 |
+
# Dependencies
|
15 |
+
Install corresponding dependencies to run the model:
|
16 |
+
```bash
|
17 |
+
pip install numpy torch torchaudio einops transformers efficientnet_pytorch
|
18 |
+
```
|
19 |
+
|
20 |
+
# Usage
|
21 |
+
```python
|
22 |
+
import torch
|
23 |
+
from transformers import AutoModel, PreTrainedTokenizerFast
|
24 |
+
import torchaudio
|
25 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
26 |
+
# use the model trained on AudioCaps
|
27 |
+
model = AutoModel.from_pretrained(
|
28 |
+
"wsntxxn/effb2-trm-audiocaps-captioning",
|
29 |
+
trust_remote_code=True
|
30 |
+
).to(device)
|
31 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained(
|
32 |
+
"wsntxxn/audiocaps-simple-tokenizer"
|
33 |
+
)
|
34 |
+
# inference on a single audio clip
|
35 |
+
wav, sr = torchaudio.load("/path/to/file.wav")
|
36 |
+
wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
|
37 |
+
if wav.size(0) > 1:
|
38 |
+
wav = wav.mean(0).unsqueeze(0)
|
39 |
+
with torch.no_grad():
|
40 |
+
word_idxs = model(
|
41 |
+
audio=wav,
|
42 |
+
audio_length=[wav.size(1)],
|
43 |
+
)
|
44 |
+
caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
|
45 |
+
print(caption)
|
46 |
+
# inference on a batch
|
47 |
+
wav1, sr1 = torchaudio.load("/path/to/file1.wav")
|
48 |
+
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
|
49 |
+
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]
|
50 |
+
wav2, sr2 = torchaudio.load("/path/to/file2.wav")
|
51 |
+
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
|
52 |
+
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]
|
53 |
+
wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)
|
54 |
+
with torch.no_grad():
|
55 |
+
word_idxs = model(
|
56 |
+
audio=wav_batch,
|
57 |
+
audio_length=[wav1.size(0), wav2.size(0)],
|
58 |
+
)
|
59 |
+
captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True)
|
60 |
+
print(captions)
|
61 |
+
```
|