Descript Audio Codec (DAC)
DAC is the state-of-the-art audio tokenizer with improvement upon the previous tokenizers like SoundStream and EnCodec.
This model card provides an easy-to-use API for a pretrained DAC [1] for 16khz audio whose backbone and pretrained weights are from its original reposotiry. With this API, you can encode and decode by a single line of code either using CPU or GPU. Furhtermore, it supports chunk-based processing for memory-efficient processing, especially important for GPU processing.
Model variations
There are three types of model depending on an input audio sampling rate.
Model | Input audio sampling rate [khz] |
---|---|
hance-ai/descript-audio-codec-44khz |
44.1khz |
hance-ai/descript-audio-codec-24khz |
24khz |
hance-ai/descript-audio-codec-16khz |
16khz |
Dependency
See requirements.txt
.
Usage
Load
from transformers import AutoModel
# device setting
device = 'cpu' # or 'cuda:0'
# load
model = AutoModel.from_pretrained('hance-ai/descript-audio-codec-16khz', trust_remote_code=True)
model.to(device)
Encode
audio_filename = 'path/example_audio.wav'
zq, s = model.encode(audio_filename)
zq
is discrete embeddings with dimension of (1, num_RVQ_codebooks, token_length) and s
is a token sequence with dimension of (1, num_RVQ_codebooks, token_length).
Decode
# decoding from `zq`
waveform = model.decode(zq=zq) # (1, 1, audio_length); the output has a mono channel.
# decoding from `s`
waveform = model.decode(s=s) # (1, 1, audio_length); the output has a mono channel.
Save a waveform as an audio file
model.waveform_to_audiofile(waveform, 'out.wav')
Save and load tokens
model.save_tensor(s, 'tokens.pt')
loaded_s = model.load_tensor('tokens.pt')
References
[1] Kumar, Rithesh, et al. "High-fidelity audio compression with improved rvqgan." Advances in Neural Information Processing Systems 36 (2024).
- Downloads last month
- 11