File size: 8,749 Bytes
e41d8f3
 
 
 
 
42dcaa4
e41d8f3
 
 
 
 
 
 
 
 
 
 
 
209edae
 
 
 
e41d8f3
209edae
e41d8f3
dec87b4
 
e41d8f3
 
 
999798b
 
e41d8f3
 
 
 
 
 
 
c75454d
 
 
 
 
 
 
 
 
 
 
 
e41d8f3
 
 
 
 
 
 
 
 
 
8b0e18a
e41d8f3
 
 
8b0e18a
e41d8f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
999798b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: mit
license_link: https://github.com/microsoft/VidTok/blob/main/LICENSE

tags:
- video-to-video
- tokenization
- video generation
- world model
- vae
- fsq
---

# VidTok
A Family of Versatile and State-Of-The-Art Video Tokenizers

<img src="./assets/radar.png" width="95%" alt="radar" align="center">

VidTok is a cutting-edge family of video tokenizers that delivers state-of-the-art performance in both continuous and discrete tokenizations with various compression rates. VidTok incorporates several key advancements over existing approaches:
 * ⚡️ **Efficient Architecture**. Separate spatial and temporal sampling reduces computational complexity without sacrificing quality.
 * 🔥 **Advanced Quantization**. Finite Scalar Quantization (FSQ) addresses training instability and codebook collapse in discrete tokenization.
 * 💥 **Enhanced Training**. A two-stage strategy—pre-training on low-res videos and fine-tuning on high-res—boosts efficiency. Reduced frame rates improve motion dynamics representation.

VidTok, trained on a large-scale video dataset, outperforms previous models across all metrics, including PSNR, SSIM, LPIPS, and FVD.

<video controls autoplay src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F619b7b1cab4c7b7f16a7d59e%2F4v2I2YAZJeWSnd7iqntGX.mp4%26quot%3B%3C%2Fspan%3E%26gt%3B%3C%2Fspan%3E%3C%2Fspan%3E%3Cspan class="language-xml"></video>

Resources and technical documentation:

+ [GitHub](https://github.com/microsoft/VidTok)
+ [arXiv](https://arxiv.org/abs/2412.13061)
+ [paper](https://huggingface.co/papers/2412.17726)

## Model Performance

The following table shows model performance evaluated on 30 test videos in [MCL_JCL](https://mcl.usc.edu/mcl-jcv-dataset/) dataset, with a sample fps of 30. The input size is `17x256x256` for causal models and `16x256x256` for non-causal models. `VCR` indicates the video compression ratio `TxHxW`.

| Model  | Regularizer | Causal  | VCR  | PSNR  | SSIM  | LPIPS  | FVD |
|------|------|------|------|------|------|------|------|
| [vidtok_kl_causal_488_4chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_causal_488_4chn.ckpt)  | KL-4chn  | ✔️   | 4x8x8  | 29.64 | 0.852| 0.114| 194.2|
| [vidtok_kl_causal_488_8chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_causal_488_8chn.ckpt)  |  KL-8chn |  ✔️   |4x8x8  | 31.83 | 0.897| 0.083| 109.3|
| [vidtok_kl_causal_488_16chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_causal_488_16chn.ckpt)  | KL-16chn  | ✔️  | 4x8x8   | 35.04 |0.942 |0.047 | 78.9| 
| [vidtok_kl_causal_41616_4chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_causal_41616_4chn.ckpt)  | KL-4chn  | ✔️  | 4x16x16   | 25.05  | 0.711| 0.228| 549.1| |
| [vidtok_kl_noncausal_488_4chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_noncausal_488_4chn.ckpt)  | KL-4chn  |  ✖️ | 4x8x8   | 30.60  | 0.876 | 0.098| 157.9| 
| [vidtok_kl_noncausal_41616_4chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_noncausal_41616_4chn.ckpt)  |  KL-4chn | ✖️  | 4x16x16   | 26.06  | 0.751 | 0.190|423.2 |
| [vidtok_fsq_causal_488_262144](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_fsq_causal_488_262144.ckpt)  | FSQ-262,144  | ✔️  |  4x8x8  | 29.82  | 0.867 |0.106 | 160.1|
| [vidtok_fsq_causal_488_32768](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_fsq_causal_488_32768.ckpt)  | FSQ-32,768  | ✔️  | 4x8x8   | 29.16  | 0.854 | 0.117| 196.9|
| [vidtok_fsq_causal_488_4096](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_fsq_causal_488_4096.ckpt)  | FSQ-4096  | ✔️  | 4x8x8   | 28.36 | 0.832 | 0.133| 218.1|
| [vidtok_fsq_causal_41616_262144](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_fsq_causal_41616_262144.ckpt)  |  FSQ-262,144 |  ✔️ |  4x16x16 | 25.38 | 0.738 |0.206 | 430.1|
| [vidtok_fsq_noncausal_488_262144](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_fsq_noncausal_488_262144.ckpt)  |  FSQ-262,144 | ✖️  | 4x8x8 | 30.78 | 0.889| 0.091| 132.1|
| [vidtok_fsq_noncausal_41616_262144](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_fsq_noncausal_41616_262144.ckpt)  |  FSQ-262,144 | ✖️  |  4x16x16  | 26.37 | 0.772| 0.171| 357.0|

## Training
### Training Data

The training data of VidTok is divided into two sets based on video quality. 
1. Training Set 1 consists of approximately 400K of low-resolution videos (e.g., 480p). The videos are natural videos with diverse lightning, motions, and scenarios. 
2. Training Set 2 includes approximately 10K of high-resolution videos (e.g., 1080p). The videos are natural videos with diverse lightning, motions, and scenarios.

### Training Procedure

Please refer to the [paper](https://arxiv.org/pdf/2412.13061) and [code](https://github.com/microsoft/VidTok) for detailed training instructions.

## Evaluation

Please refer to the [paper](https://arxiv.org/pdf/2412.13061) and [code](https://github.com/microsoft/VidTok) for detailed evaluation instructions.

## Intended Uses

We are sharing our model with the research community to foster further research in this area: 
* Training your own video tokenizers for research purpose.
* Video tokenization with various compression rates.

## Downstream Uses

Our model is designed to accelerate research on video-centric research, for use as a building block for the following applications:
* Video generation on the continuous / discrete latent tokens.
* World modelling on the continuous / discrete latent tokens.
* Generative games on the continuous / discrete latent tokens.
* Video understanding from the latent tokens.

## Out-of-scope Uses

Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of video tokenizers (e.g., performance degradation on out-of-domain data) as they select use cases, and evaluate and mitigate for privacy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. 

Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. 

## Risks and Limitations 

Some of the limitations of this model to be aware of include:
* VidTok may lose detailed information on the reconstructed content.
* VidTok inherits any biases, errors, or omissions characteristic of its training data.
* VidTok was developed for research and experimental purposes. Further testing and validation are needed before considering its application in commercial or real-world scenarios.

## Recommendations

Some recommendations for alleviating potential limitations include:
* Lower compression rate provides higher reconstruction quality.
* For domain-specific video tokenization, it is suggested to fine-tune the model on the domain-specific videos.

## License

The model is released under the [MIT license](https://github.com/microsoft/VidTok/blob/main/LICENSE).

## Contact

We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at [email protected].

## Contributing

This project welcomes contributions and suggestions.  Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [[email protected]](mailto:[email protected]) with any additional questions or comments.


## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
trademarks or logos is subject to and must follow 
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.