aiqcamp commited on
Commit
82a93d6
·
verified ·
1 Parent(s): 21eb743

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -156
README.md CHANGED
@@ -1,163 +1,10 @@
1
  ---
2
- title: MMAudio — generating synchronized audio from video/text
3
- emoji: 🔊
4
  colorFrom: blue
5
  colorTo: indigo
6
  sdk: gradio
7
  app_file: app.py
8
  pinned: false
 
9
  ---
10
-
11
-
12
- # [Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis](https://hkchengrex.github.io/MMAudio)
13
-
14
- [Ho Kei Cheng](https://hkchengrex.github.io/), [Masato Ishii](https://scholar.google.co.jp/citations?user=RRIO1CcAAAAJ), [Akio Hayakawa](https://scholar.google.com/citations?user=sXAjHFIAAAAJ), [Takashi Shibuya](https://scholar.google.com/citations?user=XCRO260AAAAJ), [Alexander Schwing](https://www.alexander-schwing.de/), [Yuki Mitsufuji](https://www.yukimitsufuji.com/)
15
-
16
- University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation
17
-
18
-
19
- [[Paper (being prepared)]](https://hkchengrex.github.io/MMAudio) [[Project Page]](https://hkchengrex.github.io/MMAudio)
20
-
21
-
22
- **Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.**
23
-
24
- ## Highlight
25
-
26
- MMAudio generates synchronized audio given video and/or text inputs.
27
- Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
28
- Moreover, a synchronization module aligns the generated audio with the video frames.
29
-
30
-
31
- ## Results
32
-
33
- (All audio from our algorithm MMAudio)
34
-
35
- Videos from Sora:
36
-
37
- https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
38
-
39
-
40
- Videos from MovieGen/Hunyuan Video/VGGSound:
41
-
42
- https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
43
-
44
- For more results, visit https://hkchengrex.com/MMAudio/video_main.html.
45
-
46
- ## Installation
47
-
48
- We have only tested this on Ubuntu.
49
-
50
- ### Prerequisites
51
-
52
- We recommend using a [miniforge](https://github.com/conda-forge/miniforge) environment.
53
-
54
- - Python 3.8+
55
- - PyTorch **2.5.1+** and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/)
56
- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`)
57
-
58
- **Clone our repository:**
59
-
60
- ```bash
61
- git clone https://github.com/hkchengrex/MMAudio.git
62
- ```
63
-
64
- **Install with pip:**
65
-
66
- ```bash
67
- cd MMAudio
68
- pip install -e .
69
- ```
70
-
71
- (If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
72
-
73
- **Pretrained models:**
74
-
75
- The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in `mmaudio/utils/download_utils.py`
76
-
77
- | Model | Download link | File size |
78
- | -------- | ------- | ------- |
79
- | Flow prediction network, small 16kHz | <a href="https://databank.illinois.edu/datafiles/k6jve/download" download="mmaudio_small_16k.pth">mmaudio_small_16k.pth</a> | 601M |
80
- | Flow prediction network, small 44.1kHz | <a href="https://databank.illinois.edu/datafiles/864ya/download" download="mmaudio_small_44k.pth">mmaudio_small_44k.pth</a> | 601M |
81
- | Flow prediction network, medium 44.1kHz | <a href="https://databank.illinois.edu/datafiles/pa94t/download" download="mmaudio_medium_44k.pth">mmaudio_medium_44k.pth</a> | 2.4G |
82
- | Flow prediction network, large 44.1kHz **(recommended)** | <a href="https://databank.illinois.edu/datafiles/4jx76/download" download="mmaudio_large_44k.pth">mmaudio_large_44k.pth</a> | 3.9G |
83
- | 16kHz VAE | <a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-16.pth">v1-16.pth</a> | 655M |
84
- | 16kHz BigVGAN vocoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/best_netG.pt">best_netG.pt</a> | 429M |
85
- | 44.1kHz VAE |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/v1-44.pth">v1-44.pth</a> | 1.2G |
86
- | Synchformer visual encoder |<a href="https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth">synchformer_state_dict.pth</a> | 907M |
87
-
88
- The 44.1kHz vocoder will be downloaded automatically.
89
-
90
- The expected directory structure (full):
91
-
92
- ```bash
93
- MMAudio
94
- ├── ext_weights
95
- │ ├── best_netG.pt
96
- │ ├── synchformer_state_dict.pth
97
- │ ├── v1-16.pth
98
- │ └── v1-44.pth
99
- ├── weights
100
- │ ├── mmaudio_small_16k.pth
101
- │ ├── mmaudio_small_44k.pth
102
- │ ├── mmaudio_medium_44k.pth
103
- │ └── mmaudio_large_44k.pth
104
- └── ...
105
- ```
106
-
107
- The expected directory structure (minimal, for the recommended model only):
108
-
109
- ```bash
110
- MMAudio
111
- ├── ext_weights
112
- │ ├── synchformer_state_dict.pth
113
- │ └── v1-44.pth
114
- ├── weights
115
- │ └── mmaudio_large_44k.pth
116
- └── ...
117
- ```
118
-
119
- ## Demo
120
-
121
- By default, these scripts use the `large_44k` model.
122
- In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
123
-
124
- ### Command-line interface
125
-
126
- With `demo.py`
127
- ```bash
128
- python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
129
- ```
130
- The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
131
- See the file for more options.
132
- Simply omit the `--video` option for text-to-audio synthesis.
133
- The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
134
-
135
-
136
- ### Gradio interface
137
-
138
- Supports video-to-audio and text-to-audio synthesis.
139
-
140
- ```
141
- python gradio_demo.py
142
- ```
143
-
144
- ### Known limitations
145
-
146
- 1. The model sometimes generates undesired unintelligible human speech-like sounds
147
- 2. The model sometimes generates undesired background music
148
- 3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
149
-
150
- We believe all of these three limitations can be addressed with more high-quality training data.
151
-
152
- ## Training
153
- Work in progress.
154
-
155
- ## Evaluation
156
- Work in progress.
157
-
158
- ## Acknowledgement
159
- Many thanks to:
160
- - [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model
161
- - [BigVGAN](https://github.com/NVIDIA/BigVGAN)
162
- - [Synchformer](https://github.com/v-iashin/Synchformer)
163
-
 
1
  ---
2
+ title: Dokdo.1
3
+ emoji: 🔊✨
4
  colorFrom: blue
5
  colorTo: indigo
6
  sdk: gradio
7
  app_file: app.py
8
  pinned: false
9
+ short_description: automated video and sound synthesis from images
10
  ---