VidSketch: Hand-drawn Sketch-Driven Video Generation with Diffusion Control

Video animation generated by our VidSketch. Our method generates video animations with a hand-drawn sketch sequence (corresponding sketches placed in the top-left corner of the respective frames, with examples from top to bottom guided by 1, 2, 4, and 6 sketches) and simple textual prompts. This enables the creation of high-quality, spatiotemporal-consistent video animations, breaking barriers in the art profession.
Our VidSketch method empowers users of all skill levels to effortlessly create stunning, high-quality video animations using concise text prompts and intuitive hand-drawn sketches.

clipasso

Abstract

Creating high-quality aesthetic images and video animations typically demands advanced drawing skills beyond ordinary users. While AIGC advancements have enabled automated image generation from sketches, these methods are limited to static images and cannot control video animation generation using hand-drawn sketches. To solve this problem, our method, VidSketch, is the first to enable the generation of high-quality video animations solely from any number of hand-drawn sketches and simple text prompts, thereby bridging the gap between ordinary users and artists. Moreover, to address the diverse variations in users' drawing skills, we propose the Abstraction-Level Sketch Control Strategy, which automatically adjusts the guidance strength of sketches during the generation process. Additionally, to tackle inter-frame inconsistency, we propose an Enhanced SparseCausal-Attention mechanism, significantly improving the spatiotemporal consistency of the generated video animations.

Hand-drawn Sketches for different categories

🎞 VidSketch in different style 🎞


Surrealistic-style	Magical-style

Fantasy-style	Realistic-style

How does it work?

Hand-drawn Sketch-Driven Video Generation

Pipeline of our VidSketch. During training, we use high-quality, small-scale video datasets categorized by type to train the Enhanced SparseCausal-Attention (SC-Attention) and Temporal Attention blocks, improving spatiotemporal consistency in video animations. During inference, users simply input a prompt and sketch sequences to generate tailored high-quality animations. Specifically, the first frame is generated using T2I-Adapter, while the entire sketch sequence is processed by the Inflated T2I-Adapter to extract information, which is injected into VDM's upsampling layers to guide video generation.

Our training approach adheres to the traditional VDMs framework. First, we conducted an extensive search across the internet to collect high-quality training videos for each action category, with 8–12 videos. Subsequently, we trained the SparseCausal-Attention and Temp-Attention modules separately for each action category. This strategy effectively mitigates the challenge of limited high-quality video data, enhancing the spatiotemporal consistency and quality of the generated videos.

Abstraction-Level Sketch Control Strategy

To accommodate the significant variations in users' drawing skills, we conduct a detailed quantitative analysis of continuity, connectivity, and texture detail in sketch sequences to comprehensively evaluate the abstraction level of sketch sequences. This enables us to dynamically adjust the control strength during the video generation process. The specific implementation details of the Abstraction-Level Sketch Control Strategy are illustrated in the picture below.

We perform a quantitative analysis of the connectivity, continuity, and texture details of sketches to automatically evaluate the abstraction level of hand-drawn sketche sequences. Sketches with varying levels of abstraction correspond to different generation control intensities, ensuring that VidSketch is adaptable to users of drawing skills, thereby enhancing the method's generalization.

Enhanced SparseCausal-Attention mechanism

The primary distinction between video animation generation and image generation tasks lies in the requirement to maintain spatiotemporal consistency across video frames. To address the inherent challenges of video animation generation,we propose an Enhanced SparseCausal-Attention Mechanism. In this mechanism, for each frame i in the video sequence, key/value (K/V) representations are extracted from both the initial frame and the preceding frame (i-1). The query Q representation of the current frame i is then employed to compute the attention mechanism.

This mechanism effectively maintains the inter-frame consistency under identical conditions, significantly enhancing the quality of the generated video animations and better satisfying the demand for high-quality video animation production.