We have witnessed the rise of generative AI models in the last couple of months. They went from generating low-resolution face-like images to generating high-resolution photo-realistic images quite quickly. It is now possible to obtain unique, photo-realistic images by describing what we want to see. Moreover, maybe more impressive is the fact that we can even use diffusion models to generate videos for us.
The key contributor to generative AI is the diffusion models. They take a text prompt and generate an output that matches that description. They do this by gradually transforming a set of random numbers into an image or video while adding more details to make it look like the description. These models learn from datasets with millions of samples, so they can generate new visuals that look similar to the ones they’ve seen before. Though, the dataset can be the key problem sometimes.
It is almost always not feasible to train a diffusion model for video generation from scratch. They require extremely large datasets and also equipment to feed their needs. Constructing such datasets is only possible for a couple of institutes around the world, as accessing and collecting those data is out of reach for most people due to the cost. We have to go with existing models and try to make them work for our use case.
Even if somehow you manage to prepare a text-video dataset with millions, if not billions, of pairs, you still need to find a way to obtain the hardware power required to feed those large-scale models. Therefore, the high cost of video diffusion models makes it difficult for many users to customize these technologies for their own needs.
What if there was a way to bypass this requirement? Could we have a way to reduce the cost of training video diffusion models? Time to meet Text2Video-Zero
Text2Video-Zero is a zero-shot text-to-video generative model, which means it does not require any training to be customized. It uses pre-trained text-to-image models and converts them into a temporally consistent video generation model. In the end, the video displays a sequence of images in a quick manner to stimulate the movement. The idea of using them consecutively to generate the video is a straightforward solution.
Though, we cannot just use an image generation model hundreds of times and combine the outputs at the end. This will not work because there is no way to ensure the models draw the same objects all the time. We need a way to ensure temporal consistency in the model.
To enforce temporal consistency, Text2Video-Zero uses two lightweight modifications.
First, it enriches the latent vectors of generated frames with motion information to keep the global scene and the background time consistent. This is done by adding motion information to the latent vectors instead of just randomly sampling them. However, these latent vectors do not have sufficient restrictions to depict specific colors, shapes, or identities, resulting in temporal inconsistencies, particularly for the foreground object. Therefore, a second modification is required to tackle this issue.
The second modification is about the attention mechanism. To leverage the power of cross-frame attention and at the same time exploit a pre-trained diffusion model without retraining, each self-attention layer is replaced with cross-frame attention, and the attention for each frame is focused on the first frame. This helps Text2Video-Zero to preserve the context, appearance, and identity of the foreground object throughout the entire sequence.
Experiments show that these modifications lead to high-quality and time-consistent video generation, even though it does not require training on large-scale video data. Furthermore, it is not limited to text-to-video synthesis but is also applicable to conditional and specialized video generation, as well as video editing by textual instruction.
Check out the Paper and Github. Don’t forget to join our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He is currently pursuing a Ph.D. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.