Diffusers
Safetensors
AudioLDM2Pipeline
sanchit-gandhi commited on
Commit
494ea84
·
1 Parent(s): 30f4793

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ ---
4
+
5
+ # AudioLDM 2 Large
6
+
7
+ AudioLDM 2 is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input.
8
+ It is available in the 🧨 Diffusers library from v0.21.0 onwards.
9
+
10
+ # Model Details
11
+
12
+ AudioLDM 2 was proposed in the paper [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al.
13
+
14
+ AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects,
15
+ human speech and music.
16
+
17
+ # Checkpoint Details
18
+
19
+ This is the original, **large** version of the AudioLDM 2 model, also referred to as **audioldm2-full-large-1150k**.
20
+
21
+ There are three official AudioLDM 2 checkpoints. Two of these checkpoints are applicable to the general task of text-to-audio
22
+ generation. The third checkpoint is trained exclusively on text-to-music generation. All checkpoints share the same
23
+ model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on
24
+ the three official checkpoints:
25
+
26
+ | Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h |
27
+ |-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
28
+ | [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 350M | 1.1B | 1150k |
29
+ | [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M | 1.5B | 1150k |
30
+ | [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M | 1.1B | 665k |
31
+
32
+ ## Model Sources
33
+
34
+ - [**Original Repository**](https://github.com/haoheliu/audioldm2)
35
+ - [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2)
36
+ - [**Paper**](https://arxiv.org/abs/2308.05734)
37
+ - [**Demo**](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
38
+
39
+ # Usage
40
+
41
+ First, install the required packages:
42
+
43
+ ```
44
+ pip install --upgrade diffusers transformers
45
+ ```
46
+
47
+ ## Text-to-Audio
48
+
49
+ For text-to-audio generation, the [AudioLDM2Pipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2) can be
50
+ used to load pre-trained weights and generate text-conditional audio outputs:
51
+
52
+ ```python
53
+ from diffusers import AudioLDM2Pipeline
54
+ import torch
55
+
56
+ repo_id = "cvssp/audioldm2-large"
57
+ pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
58
+ pipe = pipe.to("cuda")
59
+
60
+ prompt = "The sound of a hammer hitting a wooden surface"
61
+ audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]
62
+ ```
63
+
64
+ The resulting audio output can be saved as a .wav file:
65
+ ```python
66
+ import scipy
67
+
68
+ scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
69
+ ```
70
+
71
+ Or displayed in a Jupyter Notebook / Google Colab:
72
+ ```python
73
+ from IPython.display import Audio
74
+
75
+ Audio(audio, rate=16000)
76
+ ```
77
+
78
+ ## Tips
79
+
80
+ Prompts:
81
+ * Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
82
+ * It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
83
+
84
+ Inference:
85
+ * The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
86
+ * The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
87
+
88
+ When evaluating generated waveforms:
89
+
90
+ * The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation
91
+ * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
92
+
93
+ The following example demonstrates how to construct a good audio generation using the aforementioned tips:
94
+
95
+ ```python
96
+ import scipy
97
+ import torch
98
+ from diffusers import AudioLDM2Pipeline
99
+
100
+ # load the pipeline
101
+ repo_id = "cvssp/audioldm2-large"
102
+ pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
103
+ pipe = pipe.to("cuda")
104
+
105
+ # define the prompts
106
+ prompt = "The sound of a hammer hitting a wooden surface"
107
+ negative_prompt = "Low quality."
108
+
109
+ # set the seed
110
+ generator = torch.Generator("cuda").manual_seed(0)
111
+
112
+ # run the generation
113
+ audio = pipe(
114
+ prompt,
115
+ negative_prompt=negative_prompt,
116
+ num_inference_steps=200,
117
+ audio_length_in_s=10.0,
118
+ num_waveforms_per_prompt=3,
119
+ ).audios
120
+
121
+ # save the best audio sample (index 0) as a .wav file
122
+ scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
123
+ ```
124
+
125
+ # Citation
126
+
127
+ **BibTeX:**
128
+ ```
129
+ @article{liu2023audioldm2,
130
+ title={"AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining"},
131
+ author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
132
+ journal={arXiv preprint arXiv:2308.05734},
133
+ year={2023}
134
+ }
135
+ ```