Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -7,50 +7,32 @@ license_link: LICENSE
|
|
7 |
<!-- ## **HunyuanVideo** -->
|
8 |
|
9 |
<p align="center">
|
10 |
-
<img src="https://
|
11 |
</p>
|
12 |
|
13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
14 |
|
15 |
-----
|
16 |
|
17 |
-
This repo contains
|
18 |
|
19 |
> [**HunyuanVideo: A Systematic Framework For Large Video Generation Model Training**](https://github.com/Tencent/HunyuanVideo/blob/main/assets/hunyuanvideo.pdf) <br>
|
20 |
|
21 |
-
Due to the limitation of github page, the video is compressed. The original video can be downloaded from [here](https://aivideo.hunyuan.tencent.com/download/HunyuanVideo/material/demo.mov).
|
22 |
|
23 |
-
## π₯π₯π₯ News!!
|
24 |
-
* Dec 3, 2024: π€ We release the inference code and model weights of HunyuanVideo.
|
25 |
-
|
26 |
-
## π Open-source Plan
|
27 |
-
|
28 |
-
- HunyuanVideo (Text-to-Video Model)
|
29 |
-
- [x] Inference
|
30 |
-
- [x] Checkpoints
|
31 |
-
- [ ] Penguin Video Benchmark
|
32 |
-
- [ ] Web Demo (Gradio)
|
33 |
-
- [ ] ComfyUI
|
34 |
-
- [ ] Diffusers
|
35 |
-
- HunyuanVideo (Image-to-Video Model)
|
36 |
-
- [ ] Inference
|
37 |
-
- [ ] Checkpoints
|
38 |
|
39 |
## Contents
|
40 |
- [HunyuanVideo: A Systematic Framework For Large Video Generation Model Training](#hunyuanvideo--a-systematic-framework-for-large-video-generation-model-training)
|
41 |
-
- [π₯π₯π₯ News!!](#-news!!)
|
42 |
-
- [π Open-source Plan](#-open-source-plan)
|
43 |
- [Contents](#contents)
|
44 |
- [**Abstract**](#abstract)
|
45 |
-
- [**HunyuanVideo Overall Architechture**](
|
46 |
-
- [π **HunyuanVideo Key Features**](
|
47 |
- [**Unified Image and Video Generative Architecture**](#unified-image-and-video-generative-architecture)
|
48 |
- [**MLLM Text Encoder**](#mllm-text-encoder)
|
49 |
- [**3D VAE**](#3d-vae)
|
50 |
- [**Prompt Rewrite**](#prompt-rewrite)
|
51 |
-
- [π Comparisons](
|
52 |
-
- [π BibTeX](
|
53 |
-
- [Acknowledgements](
|
54 |
---
|
55 |
|
56 |
## **Abstract**
|
@@ -66,7 +48,7 @@ using a large language model, and used as the condition. Gaussian noise and cond
|
|
66 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
67 |
the 3D VAE decoder.
|
68 |
<p align="center">
|
69 |
-
<img src="https://
|
70 |
</p>
|
71 |
|
72 |
## π **HunyuanVideo Key Features**
|
@@ -78,7 +60,7 @@ tokens and feed them into subsequent Transformer blocks for effective multimodal
|
|
78 |
This design captures complex interactions between visual and semantic information, enhancing
|
79 |
overall model performance.
|
80 |
<p align="center">
|
81 |
-
<img src="https://
|
82 |
</p>
|
83 |
|
84 |
### **MLLM Text Encoder**
|
@@ -86,19 +68,19 @@ Some previous text-to-video model typically use pretrainednCLIP and T5-XXL as te
|
|
86 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
87 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
88 |
<p align="center">
|
89 |
-
<img src="https://
|
90 |
</p>
|
91 |
|
92 |
### **3D VAE**
|
93 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
94 |
<p align="center">
|
95 |
-
<img src="https://
|
96 |
</p>
|
97 |
|
98 |
### **Prompt Rewrite**
|
99 |
To address the variability in linguistic style and length of user-provided prompts, we fine-tune the [Hunyuan-Large model](https://github.com/Tencent/Tencent-Hunyuan-Large) as our prompt rewrite model to adapt the original user prompt to model-preferred prompt.
|
100 |
|
101 |
-
We provide two rewrite modes: Normal mode and Master mode, which can be called using different prompts. The
|
102 |
|
103 |
The Prompt Rewrite Model can be directly deployed and inferred using the [Hunyuan-Large original code](https://github.com/Tencent/Tencent-Hunyuan-Large). We release the weights of the Prompt Rewrite Model [here](https://huggingface.co/Tencent/HunyuanVideo-PromptRewrite).
|
104 |
|
|
|
7 |
<!-- ## **HunyuanVideo** -->
|
8 |
|
9 |
<p align="center">
|
10 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/logo.png" height=100>
|
11 |
</p>
|
12 |
|
13 |
# HunyuanVideo: A Systematic Framework For Large Video Generation Model Training
|
14 |
|
15 |
-----
|
16 |
|
17 |
+
This repo contains the weights of HunyuanVideo-PromptRewrite model. You can find more visualizations on our [project page](https://aivideo.hunyuan.tencent.com).
|
18 |
|
19 |
> [**HunyuanVideo: A Systematic Framework For Large Video Generation Model Training**](https://github.com/Tencent/HunyuanVideo/blob/main/assets/hunyuanvideo.pdf) <br>
|
20 |
|
|
|
21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
## Contents
|
24 |
- [HunyuanVideo: A Systematic Framework For Large Video Generation Model Training](#hunyuanvideo--a-systematic-framework-for-large-video-generation-model-training)
|
|
|
|
|
25 |
- [Contents](#contents)
|
26 |
- [**Abstract**](#abstract)
|
27 |
+
- [**HunyuanVideo Overall Architechture**](#-hunyuanvideo-overall-architechture)
|
28 |
+
- [π **HunyuanVideo Key Features**](#-hunyuanvideo-key-features)
|
29 |
- [**Unified Image and Video Generative Architecture**](#unified-image-and-video-generative-architecture)
|
30 |
- [**MLLM Text Encoder**](#mllm-text-encoder)
|
31 |
- [**3D VAE**](#3d-vae)
|
32 |
- [**Prompt Rewrite**](#prompt-rewrite)
|
33 |
+
- [π Comparisons](#-comparisons)
|
34 |
+
- [π BibTeX](#-bibtex)
|
35 |
+
- [Acknowledgements](#-acknowledgements)
|
36 |
---
|
37 |
|
38 |
## **Abstract**
|
|
|
48 |
input, our generate model generates an output latent, which is decoded to images or videos through
|
49 |
the 3D VAE decoder.
|
50 |
<p align="center">
|
51 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/overall.png" height=300>
|
52 |
</p>
|
53 |
|
54 |
## π **HunyuanVideo Key Features**
|
|
|
60 |
This design captures complex interactions between visual and semantic information, enhancing
|
61 |
overall model performance.
|
62 |
<p align="center">
|
63 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/backbone.png" height=350>
|
64 |
</p>
|
65 |
|
66 |
### **MLLM Text Encoder**
|
|
|
68 |
Compared with CLIP, MLLM has been demonstrated superior ability in image detail description
|
69 |
and complex reasoning; (iii) MLLM can play as a zero-shot learner by following system instructions prepended to user prompts, helping text features pay more attention to key information. In addition, MLLM is based on causal attention while T5-XXL utilizes bidirectional attention that produces better text guidance for diffusion models. Therefore, we introduce an extra bidirectional token refiner for enhacing text features.
|
70 |
<p align="center">
|
71 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/text_encoder.png" height=275>
|
72 |
</p>
|
73 |
|
74 |
### **3D VAE**
|
75 |
HunyuanVideo trains a 3D VAE with CausalConv3D to compress pixel-space videos and images into a compact latent space. We set the compression ratios of video length, space and channel to 4, 8 and 16 respectively. This can significantly reduce the number of tokens for the subsequent diffusion transformer model, allowing us to train videos at the original resolution and frame rate.
|
76 |
<p align="center">
|
77 |
+
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo/main/assets/3dvae.png" height=150>
|
78 |
</p>
|
79 |
|
80 |
### **Prompt Rewrite**
|
81 |
To address the variability in linguistic style and length of user-provided prompts, we fine-tune the [Hunyuan-Large model](https://github.com/Tencent/Tencent-Hunyuan-Large) as our prompt rewrite model to adapt the original user prompt to model-preferred prompt.
|
82 |
|
83 |
+
We provide two rewrite modes: Normal mode and Master mode, which can be called using different prompts. The Normal mode is designed to enhance the video generation model's comprehension of user intent, facilitating a more accurate interpretation of the instructions provided. The Master mode enhances the description of aspects such as composition, lighting, and camera movement, which leans towards generating videos with a higher visual quality. However, this emphasis may occasionally result in the loss of some semantic details.
|
84 |
|
85 |
The Prompt Rewrite Model can be directly deployed and inferred using the [Hunyuan-Large original code](https://github.com/Tencent/Tencent-Hunyuan-Large). We release the weights of the Prompt Rewrite Model [here](https://huggingface.co/Tencent/HunyuanVideo-PromptRewrite).
|
86 |
|