Video diffusion models, a sophisticated branch of generative models, are pivotal in synthesizing videos from textual descriptions. Despite remarkable advancements in similar domains, such as ChatGPT for text and Midjourney for images, video generation models often struggle with temporal consistency and natural dynamics. Addressing this challenge, researchers from S-Lab at Nanyang Technological University have developed FreeInit, a pioneering model designed to bridge the gap between training and inference phases of video diffusion models, thereby significantly enhancing video quality.
FreeInit operates by adjusting the noise initialization process, a crucial step in video generation. Conventional models use Gaussian noise in both the training and inference stages. However, this method results in videos lacking temporal consistency due to the uneven frequency distribution of initial noise. FreeInit innovatively addresses this issue by iteratively refining the spatial-temporal low-frequency components of the initial noise. This method does not require additional training or learnable parameters, seamlessly integrating into existing video diffusion models during inference.
The core technique of FreeInit lies in reinitializing noise to narrow the training-inference gap. It starts with independent Gaussian noise, which undergoes a denoising process to yield a clean video latent. Following this, the generated video latent is subjected to forward diffusion, resulting in noisy latents with improved temporal consistency. These noisy latents are then combined with high-frequency components of random Gaussian noise to create reinitialized noise, which serves as the starting point for new sampling iterations. This process significantly enhances the temporal consistency and visual appearance of the generated videos.
Extensive experiments were conducted to validate the efficacy of FreeInit, applying it to various text-to-video models like AnimateDiff, ModelScope, and VideoCrafter. The results were remarkable, showing improvements in temporal consistency metrics by 2.92 to 8.62. The qualitative and quantitative improvements were evident across various text prompts, demonstrating FreeInit’s versatility and effectiveness in enhancing video generation models.
The researchers have made FreeInit openly available, encouraging its widespread use and further development. The integration of FreeInit into current video generation models holds promise for significantly advancing the field of video generation, bridging a crucial gap that has long been a challenge in this domain.
Image source: Shutterstock