Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
Abstract
The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
Community
Novel approach for generating consistent long videos with diffusion models by incorporating low-frequency components from previous frames and cross-frame attention. Shows improved results on VBench benchmark.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion (2025)
- Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss (2025)
- Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising (2025)
- Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation (2024)
- Latent-Reframe: Enabling Camera Control for Video Diffusion Model without Training (2024)
- SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models (2024)
- Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper