Enhancing Sketch Animation: Text-to-Video Diffusion Models with Temporal Consistency and Rigidity Constraints

Imagine you have a simple, hand-drawn sketch of a horse on a piece of paper. Now, imagine you want that horse to gallop, but you don't want to spend hours drawing every single frame of the movement like a traditional animator. You just want to type "A galloping horse" into a computer, and poof—the sketch comes to life.

That is the goal of this paper, but the researchers found that previous attempts at doing this were a bit like trying to make a clay puppet dance: the movements were jerky, the horse's legs would stretch like rubber bands, or the whole shape would melt into a blob.

Here is how the authors fixed it, explained through simple analogies:

The Problem: The "Melting Puppet"

Previous AI methods tried to animate sketches by guessing how the lines should move. But they had two big issues:

The "Jittery" Effect: The animation would look like a strobe light, where the horse's legs would jump from one spot to another without a smooth path.
The "Rubber Band" Effect: As the horse moved, its body would stretch, squash, and twist until it looked nothing like the original drawing. The topology (the way the lines connect) would break.

The Solution: A Smart Puppeteer with Rules

The authors built a new system that acts like a very strict, highly skilled puppeteer. They didn't just let the AI guess; they gave it two specific "rules of the road" to follow.

1. The "Ruler and Spool" Rule (Length-Area Regularization)

The Analogy: Imagine your sketch is drawn with a piece of string. If you pull the string to make the horse move, the string shouldn't suddenly get longer or shorter, and it shouldn't leave a giant, messy trail of string behind it as it moves.

How it works:

Length: The AI checks that the "string" (the stroke) stays the same length from frame to frame. If a leg was 5 inches long in the first frame, it must be 5 inches long in the next. This stops the "rubber band" stretching.
Area: The AI also checks the "swept area." Imagine the leg moving from point A to point B. It shouldn't sweep out a huge, weird shape. It should move cleanly. This ensures the motion is smooth and not jerky.

The Result: The animation flows like water, not like a glitching video game.

2. The "Stiff Skeleton" Rule (ARAP Loss)

The Analogy: Think of your sketch as a character made of stiff cardboard cutouts connected by hinges. When the character runs, the cardboard pieces (the body parts) can rotate and slide, but they cannot bend, warp, or turn into jelly.

How it works:

The system treats the sketch like a mesh (a net) of triangles.
It uses a mathematical concept called "As-Rigid-As-Possible" (ARAP). This tells the AI: "Move the character, but keep every little triangle in the net as stiff and square as possible."
This prevents the horse's head from turning into a blob or its tail from twisting into a spiral.

The Result: The sketch keeps its original identity. It moves, but it still looks exactly like the drawing you started with.

The Magic Ingredient: The "Dream Guide"

To make the horse actually run (and not just wiggle), the system uses a pre-trained "Dream Guide" (a Text-to-Video Diffusion Model).

You tell the AI: "Run!"
The Dream Guide says, "Okay, here is what running looks like in a video."
The AI then tries to make your sketch match that video, but it uses the Ruler and Stiff Skeleton rules to make sure the sketch doesn't break while trying to match the video.

The Outcome

The paper shows that this new method is the best at:

Keeping the sketch looking like a sketch (no melting blobs).
Making the movement smooth (no jittery jumps).
Listening to your text (if you say "dolphin jumping," it jumps; if you say "horse running," it runs).

The One Catch

Like any new invention, it's not perfect yet.

The "One-Actor" Limit: It works great for one object (a single horse or a single dancer). But if you draw a horse and a rider, the AI sometimes gets confused and separates them, making the rider float away from the horse. It struggles to understand how two objects interact with each other.

In a Nutshell

This paper teaches an AI how to animate a drawing without ruining the drawing. It does this by forcing the AI to respect the length of the lines (so they don't stretch) and the stiffness of the shape (so it doesn't melt), all while following a text description to create a smooth, realistic motion. It's like giving a clay puppet a rigid skeleton and a ruler, so it can dance without falling apart.

1. Problem Statement

Animating hand-drawn sketches is traditionally a labor-intensive process requiring significant artistic skill and manual keyframing. While recent learning-based approaches (e.g., LiveSketch) have attempted to automate this using text prompts and pre-trained diffusion models, they suffer from two critical limitations:

Lack of Temporal Consistency: The generated animations often exhibit flickering, jitter, or erratic motion between frames.
Shape Distortion: The topology of the original sketch is frequently altered during motion, leading to unnatural deformations (e.g., a character's limb stretching unnaturally or a rigid object bending).

The authors aim to develop a method that animates a single input sketch based solely on a descriptive text prompt, ensuring smooth temporal consistency and strict shape preservation without requiring manual user input, reference videos, or skeletal rigging.

2. Methodology

The proposed framework extends the LiveSketch architecture by integrating a parametric representation of strokes with novel regularization losses.

A. Core Representation

Parametric Strokes: Each stroke in the input sketch is represented as a cubic Bézier curve.
Control Points: A sketch with $k$ strokes is defined by a set of control points $B = \{p_i\}$ . The animation sequence over $n$ frames is represented by the movement of these control points $Z$ .
Network Architecture: The system uses a neural network with two branches:
1. Local Motion Predictor ( $M_l$ ): An MLP that computes unconstrained local motion offsets.
2. Global Motion Predictor ( $M_g$ ): Estimates transformation matrices (scaling, shear, rotation, translation) for global motion.
Optimization: The system utilizes Score Distillation Sampling (SDS) loss from a pre-trained text-to-video diffusion model to guide the motion of the control points according to the text prompt.

B. Key Technical Innovations

To address the limitations of previous methods, the authors introduce two specific loss functions:

1. Length-Area (LA) Regularization (for Temporal Consistency)

Goal: To ensure smooth transitions and prevent abrupt changes in stroke geometry between consecutive frames.
Mechanism:
- Length Loss: Minimizes the variation in the arc length of the Bézier curve between frames ( $L_{i+1} - L_i$ ).
- Area Loss: Minimizes the "swept area" spanned by a stroke between two consecutive frames. This is calculated by defining a space-time Bézier surface $f(u, t)$ and integrating the cross product of its partial derivatives.
Effect: This forces the control points to move smoothly, reducing jitter and ensuring the stroke length remains stable throughout the animation.

2. As-Rigid-As-Possible (ARAP) Loss (for Shape Preservation)

Goal: To preserve the local rigidity and topology of the sketch, preventing unnatural stretching or distortion.
Mechanism:
- The control points within a frame are triangulated using Delaunay triangulation to form a mesh.
- The ARAP loss minimizes the deformation of these triangles. It calculates the optimal rotation matrix for each triangle to map initial vertices to target vertices while minimizing stretch.
- The loss is formulated as a differentiable function, allowing it to be optimized via backpropagation within the network.
Effect: This ensures that rigid parts of the sketch (like a wine glass base or a bicycle wheel) maintain their shape and topology during motion.

3. Key Contributions

Length-Area (LA) Regularization: A novel loss function that enforces temporal consistency by minimizing variations in stroke length and the swept area between frames, resulting in smoother animations.
Shape-Preserving ARAP Loss: A differentiable rigidity loss based on Delaunay triangulation that maintains the local topology of the sketch, preventing shape distortion during deformation.
Text-Driven Animation without Manual Input: The method generates high-quality, non-rigid, smooth sketch deformations using only a text prompt and a single input sketch, eliminating the need for skeletons, control points, or reference videos.
State-of-the-Art Performance: The approach outperforms existing methods (LiveSketch, VideoCrafter1) in both quantitative metrics and qualitative visual fidelity.

4. Results and Evaluation

The authors evaluated their method against VideoCrafter1 (image-to-video) and LiveSketch (sketch-to-video).

Quantitative Metrics:
- Sketch-to-Video Consistency: The proposed method achieved 0.8561, outperforming LiveSketch (0.8287) and VideoCrafter1 (0.7064).
- Text-to-Video Alignment: Achieved 0.1893, slightly outperforming LiveSketch (0.1852).
Qualitative Analysis:
- Temporal Consistency: Visual comparisons (e.g., a squirrel eating, a surfer riding) showed that the proposed method eliminates the flickering and erratic motion seen in baselines.
- Shape Preservation: Objects with rigid structures (e.g., the base of a wine glass) retained their topology, whereas baselines exhibited significant distortion.
Ablation Study:
- Removing LA Regularization resulted in erratic motion and excessive variation in stroke length.
- Removing ARAP Loss led to severe shape distortion and loss of topology during motion.

5. Significance and Limitations

Significance:
This work bridges the gap between generative AI and vector-based animation. By combining diffusion models with geometric constraints (LA and ARAP), it enables the creation of professional-quality, consistent animations from simple sketches and text, making the technology accessible to non-experts and reducing the manual effort in e-learning, entertainment, and visual storytelling.

Limitations:

Multi-Object Scenarios: The method struggles with interactions between multiple objects (e.g., a cyclist and a bicycle), often separating them unnaturally because it is primarily designed for single-object animation.
Dependency on Priors: The quality is limited by the capabilities of the underlying pre-trained text-to-video diffusion model; complex or rare motions may still result in artifacts.

Future Work: The authors suggest implementing object-specific translations and exploring more advanced text-to-video priors to handle multi-object interactions and complex motion alignment.