Lumiere introduces a revolutionary text-to-video diffusion model that tackles the challenge of synthesizing realistic, diverse, and coherent motion in videos. Using a Space-Time U-Net architecture, it generates the full temporal duration of the video in a single pass, differing from existing models that necessitate keyframe synthesis followed by temporal super-resolution. This method ensures global temporal consistency. The model employs spatial and temporal sampling and taps into a pre-trained text-to-image diffusion model to produce full-frame-rate, low-resolution videos across multiple scales. Lumiere supports a vast array of content creation tasks, including text-to-video and image-to-video generation, video stylization, inpainting, and more.
Lumiere is primarily designed for synthesizing videos that portray realistic, diverse, and coherent motion through a text-to-video diffusion model.
Lumiere uses a Space-Time U-Net architecture to generate videos.
Lumiere facilitates text-to-video, image-to-video conversion, video stylization, inpainting, and other video editing applications.
Limitations include the potential for misuse, dependence on pre-trained models, and constraints in producing high-resolution videos.
The authors include Omer Bar-Tal, Hila Chefer, and several others from institutions like Google Research and the Weizmann Institute.