Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance

The Problem: The "Frozen Frame" Effect

Imagine you have a magical photo frame (an Image-to-Video AI). You put a picture of a cat in it, and you ask the frame to "make the cat run."

Ideally, the cat should sprint across the screen. But in reality, these AI models often get stuck. The cat barely twitches. It looks like a high-quality photo that is just slightly vibrating, rather than a lively video.

The researchers found out why this happens. It's like the AI is too obsessed with the details of the original photo.

The Analogy: Imagine a painter who is given a photo of a cat and told to paint a video of it running. If the painter looks too closely at the photo's tiny details (the exact whisker shape, the specific texture of the fur) right at the very first second of painting, they get "locked in." They spend so much time trying to match those tiny details perfectly that they forget to paint the movement. The result is a beautiful, static painting that never moves.

In technical terms, the AI gets "over-conditioned" by the high-frequency details (sharp edges, textures) of the input image. It takes a "shortcut" to copy the look of the image immediately, sacrificing the motion.

The Simple Fix (But with a Catch)

The researchers first tried a simple trick: Blur the photo before showing it to the AI.

The Analogy: If you give the painter a blurry, low-resolution photo of the cat, they can't get stuck on the tiny whiskers. They have to focus on the big picture: "Okay, the cat is here, and it needs to run." Because the details are fuzzy, the painter is forced to create a dynamic, flowing motion.
The Catch: While the motion is great, the final video looks blurry and low-quality because the AI started with a blurry image. You get a dynamic video, but it doesn't look like the original cat anymore.

The Solution: "Adaptive Low-Pass Guidance" (ALG)

The team came up with a clever, two-step strategy called ALG. Think of it as a Director who knows when to be strict and when to be loose.

Here is how ALG works, step-by-step:

The "Blurred Start" (Early Steps):
At the very beginning of the video generation (the first few seconds of "painting"), the AI is shown a blurred version of the input image.
- Why? This stops the AI from getting obsessed with tiny details. It forces the AI to focus on the big motion: "The cat is running!" It builds a dynamic, fluid skeleton for the video.
The "Sharp Finish" (Later Steps):
Once the motion is established and the video is flowing, the AI is suddenly shown the original, sharp, high-quality photo.
- Why? Now that the "running" action is already happening, the AI can safely add back all the sharp details (the fur, the whiskers, the eyes) without getting stuck. It refines the blurry motion into a crisp, high-definition video.

The Metaphor:
Imagine building a house.

Old Way (Standard AI): You try to lay every single brick perfectly while the foundation is still wet. The house ends up standing still, but the walls are crooked because you were too focused on the bricks.
The "Blur" Way: You build the whole house out of mud. It moves and flows great, but it's a muddy mess.
The ALG Way: You first build a rough, fast-moving mud structure to get the shape and flow right (the motion). Then, once the shape is solid, you swap the mud for perfect, sharp bricks (the details). You get a house that is both dynamic and beautiful.

The Results

The researchers tested this on several popular AI video models (like Wan 2.1, Wan 2.2, and LTX-Video).

Motion Boost: The videos became 33% more dynamic. Animals ran faster, cars drove more naturally, and scenes felt alive.
Quality Preserved: Unlike the "blurry start" method, the final videos were just as sharp and high-quality as the originals.
No Training Needed: The best part? They didn't have to re-teach the AI how to learn. They just changed the "rules" of how the AI looks at the photo during the creation process. It's a free upgrade!

Summary

The paper solves the problem of AI videos being too static by teaching the AI to ignore the tiny details at the start (to encourage movement) and add them back at the end (to ensure quality). It's like telling a dancer: "First, just get the rhythm and the big moves right. Don't worry about your shoes yet. Once you're moving, we'll polish the shoes."

1. Problem Statement

Recent Image-to-Video (I2V) models, typically derived from fine-tuning pre-trained Text-to-Video (T2V) models, suffer from a significant suppression of motion dynamics. While these models excel at preserving the static appearance and high fidelity of the input reference image, they frequently generate videos that are overly static compared to their T2V counterparts.

The Phenomenon: I2V models tend to "lock" onto the fine-grained, high-frequency details of the input image immediately during the generation process.
The Consequence: This premature fixation causes the generation trajectory to converge to a "shortcut solution" where the video adheres too strictly to the reference frame, preventing the evolution of large-scale, coarse motions.
The Trade-off: Previous attempts to fix this by simply applying a low-pass filter (blurring) to the input image successfully increased motion but severely degraded image fidelity and per-frame quality.

2. Methodology: Adaptive Low-Pass Guidance (ALG)

The authors propose Adaptive Low-Pass Guidance (ALG), a training-free inference technique that resolves the trade-off between motion dynamics and image fidelity.

Core Hypothesis

The suppression of motion is caused by the over-exposure to high-frequency signals in the conditioning image during the early stages of the denoising process. This forces the model to resolve fine details too early, restricting the trajectory's flexibility to generate complex motion.

The Algorithm

ALG modifies the standard Classifier-Free Guidance (CFG) sampling procedure by adaptively changing the conditioning image based on the denoising timestep $t$ :

Early Stages ( $t \approx 0$ ): The model is conditioned on a low-pass filtered version of the input image (e.g., via bilinear downsampling and upsampling). This removes high-frequency details, preventing the "shortcut" effect and allowing the model to establish a dynamic, coarse motion trajectory.
Late Stages ( $t \approx 1$ ): The model switches back to the original, unfiltered input image. This reintroduces high-frequency details, ensuring the final video maintains high fidelity to the reference image.

Mathematical Formulation

The velocity field $v_{ALG}$ is computed as:
$v_{ALG}(x_t, t) = v_\theta(x_t, x_{init}, t, \emptyset) + w \left( v_\theta(x_t, x^{(t)}_{init}, t, c) - v_\theta(x_t, x^{(t)}_{init}, t, \emptyset) \right)$

Where:

$x^{(t)}_{init} = F_{LP}(x_{init}, \kappa(t))$ is the low-pass filtered latent, where the filter strength $\kappa(t)$ decreases as $t$ increases.
Crucial Design Choice: The unconditional term $v_\theta(x_t, x_{init}, t, \emptyset)$ uses the original unfiltered image $x_{init}$ . This acts as a "fidelity correction" term, guiding the sampling back toward the high-frequency details of the original image, preventing the distortion that occurs if the unconditional term is also filtered.

Implementation Details

Filter Type: Bilinear downsampling followed by upsampling (acting as a low-pass filter).
Schedule: A step function is used where strong filtering is applied for the first $t_{trans}$ (e.g., 10%) of denoising steps, then switched to the original image.
Training-Free: No retraining of the base model is required; it is a modification to the sampling loop.

3. Key Contributions

Diagnosis of Motion Suppression: The authors identify and quantify that I2V models suffer from a "shortcut effect" where high-frequency details in the input image prematurely constrain the generation trajectory, leading to static videos.
ALG Technique: They introduce a simple, training-free method that adaptively filters the conditioning image only during the early denoising stages. This encourages dynamic motion formation while preserving image fidelity in later stages.
Comprehensive Validation: The method is validated across multiple state-of-the-art open-source models (Wan 2.1, Wan 2.2, LTX-Video) and diverse benchmark datasets (VBench, PVD, VidProM).

4. Experimental Results

The paper reports extensive quantitative and qualitative results demonstrating the efficacy of ALG:

Dynamic Degree Improvement: On the VBench test suite, ALG achieves an average 33% improvement in "Dynamic Degree" across various models compared to the baseline CFG.
- Example: Wan 2.1 Dynamic Degree increased from 28.9 to 39.4.
- Example: Wan 2.2 Dynamic Degree increased from 31.7 to 39.0.
Preservation of Quality: Unlike naive low-pass filtering, ALG maintains or even slightly improves other quality metrics:
- Aesthetic Quality and Imaging Quality remain comparable to the baseline.
- I2V Subject Consistency (fidelity to the input image) is preserved.
- Temporal Flicker and Motion Smoothness are not negatively impacted.
Generalization: The method works effectively across different model architectures (Diffusion Transformers) and datasets, including real-world video frames (PVD) and synthetic images (VidProM).
Computational Cost: The overhead is minimal (approx. 11% increase in inference time for Wan 2.1) because the additional computation (filtering) is only applied to a small fraction of the total timesteps.

5. Significance

This work addresses a critical bottleneck in current video generation: the inability of I2V models to generate dynamic motion without sacrificing the visual quality of the input reference.

Practical Impact: ALG provides an immediate, drop-in solution for developers and researchers using existing I2V models to generate more realistic and active videos without the need for expensive retraining or complex architectural changes.
Theoretical Insight: It offers a deeper understanding of the generative process in diffusion/flow-matching models, highlighting the importance of the "coarse-to-fine" progression and how high-frequency conditioning can disrupt this natural flow.
Future Direction: The success of ALG suggests that adaptive conditioning strategies based on frequency analysis could be applied to other generative tasks where control over specific features (like motion vs. texture) is required.

In summary, ALG successfully decouples the trade-off between motion and fidelity, enabling I2V models to produce videos that are both highly dynamic and visually faithful to the source image.