Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

Imagine you are a film director trying to fill in the missing scenes of a movie. You have the first frame (a car starting at a traffic light) and the last frame (the same car parked at a destination). Your job is to generate all the smooth, logical frames in between so the car drives naturally from A to B.

This is the challenge of Generative Inbetweening.

The Problem: Two Directors, One Script

In the past, AI models tried to solve this by asking two different "directors" to work on the movie simultaneously:

Director A looks at the Start Frame and imagines, "Okay, the car is moving forward."
Director B looks at the End Frame and tries to imagine, "Okay, how did the car get here?"

The Conflict:
Here is the catch: AI video models are trained to predict the future. They are great at saying, "If the car is here, it will go there." But they are terrible at looking backward. When Director B tries to work backward from the End Frame, the AI gets confused. Instead of thinking, "The car came from the left," it often thinks, "The car is going to the left."

This creates a Motion Prior Conflict.

Director A says: "Drive forward!"
Director B says: "Drive backward!"

When you try to blend their ideas, the result is a glitchy mess. The car might flicker, ghost, or suddenly reverse direction in the middle of the scene. It's like trying to walk forward while someone else is pulling you backward; you end up stumbling in place.

The Solution: Motion Prior Distillation (MPD)

The authors of this paper propose a clever fix called Motion Prior Distillation.

Think of it like this: Instead of letting the two directors argue, you decide to fire Director B's imagination and just give them Director A's script, but played in reverse.

Here is how it works in simple steps:

Watch the Forward Path: First, the AI generates a rough draft of the video moving from Start to End. It captures the "motion energy" or the "residual" (the difference between where the car is now and where it was a split second ago).
Distill the Motion: The AI takes this "motion energy" from the forward path and distills it (like extracting essential oil) into the backward path.
The Magic Trick: When the AI tries to generate the backward path (from End to Start), it doesn't ask the End Frame, "Where did you come from?" Instead, it says, "I know exactly how the car moved forward, so I will just reverse that specific movement to get back to the start."

By doing this, the AI stops guessing the backward motion and simply reverses the forward motion. This ensures that the car doesn't suddenly decide to drive in the opposite direction. It creates a single, coherent story where the car drives smoothly from A to B without any ghosting or confusion.

Why is this better?

No More Ghosting: In previous methods, the car might look like two cars overlapping because the two directors couldn't agree. With MPD, there is only one director's logic, so the car looks solid and real.
No Extra Training: Usually, to fix these AI glitches, you have to retrain the whole model, which takes weeks and massive computers. This method is a "plug-and-play" trick. You don't need to teach the AI anything new; you just change how it thinks during the generation process.
Smoother Movies: The result is a video where the motion feels natural, continuous, and physically plausible, even if the start and end frames are far apart in time.

The Analogy: The Hiker and the Map

Imagine you are hiking from a trailhead (Start) to a mountain peak (End).

Old Method: You ask a guide at the trailhead, "How do I get to the peak?" and a guide at the peak, "How did I get here?" The peak guide, having never hiked up, might accidentally point you back down the mountain or in a circle. You get lost.
New Method (MPD): You ask the trailhead guide for the path. Then, you take that path and simply walk it backward. You know exactly where every step goes because you are just retracing the steps you already planned. You arrive at the peak perfectly, and if you turn around, you know exactly how to get back down without getting lost.

Summary

The paper introduces a smart, training-free trick to fix glitchy AI videos. By realizing that "looking backward" confuses the AI, they simply take the "forward" motion, distill its essence, and force the backward generation to follow that exact path in reverse. The result? Smooth, realistic videos that don't look like a broken VCR tape.

1. Problem Statement

The paper addresses the challenge of Generative Inbetweening, which aims to generate semantically plausible and temporally coherent intermediate frames between two keyframes (start and end). While Image-to-Video (I2V) diffusion models excel at generating videos from a single frame, applying them to bounded generation (dual constraints) is difficult.

Existing inference-time sampling strategies, known as Time Reversal Sampling, attempt to solve this by running two denoising paths:

Forward Path: Conditioned on the start frame ( $c_{start}$ ).
Backward Path: Conditioned on the end frame ( $c_{end}$ ), often temporally flipped.

The Core Issue: Motion Prior Conflict
The authors identify a fundamental misalignment between these two paths. I2V models are trained to predict forward motion. Consequently:

The Forward Path naturally follows the motion prior of the start frame.
The Backward Path, even when initialized from the end frame, tends to generate "forward-looking" sequences (predicting what happens after the end frame) rather than faithfully reconstructing the history leading up to it.
Result: When these two paths are fused (either in parallel or sequentially), they follow conflicting trajectories. This leads to temporal discontinuities, ghosting artifacts, reverse play effects, and subjects moving in contradictory directions (e.g., a car moving toward different destinations in the two paths).

2. Methodology: Motion Prior Distillation (MPD)

The proposed solution is Motion Prior Distillation (MPD), a training-free inference-time technique that aligns the backward path with the forward motion prior without requiring model retraining.

Key Intuition

Instead of treating the backward path as an independent generation process conditioned on the end frame, MPD treats it as a reconstruction of the forward path's motion, reversed in time. The method assumes that the motion residual (the change in the denoised estimate between frames) contains the essential motion information induced by the start frame.

Technical Workflow

Forward Path Generation: The model generates a forward denoising path conditioned on $c_{start}$ . It calculates the motion residual ( $\Delta \hat{x}_{0, c_{start}}$ ) and the corresponding noise residual ( $\Delta \epsilon_{fwd}$ ) between consecutive frames.
Backward Path Initialization: The backward path is initialized using the encoded latent of the end frame ( $z_{end}$ ).
Distillation (The Core Step):
- Instead of denoising the backward path using $c_{end}$ (which introduces the conflicting motion prior), the method ignores $c_{end}$ for the noise prediction.
- It reconstructs the backward noise residual ( $\epsilon_{bwd}$ ) by cumulatively subtracting the forward noise residuals ( $\Delta \epsilon_{fwd}$ ) from the initial backward noise.
- Mathematically, the backward noise is updated as: $\epsilon^{(i)}_{bwd} = \epsilon^{(1)}_{bwd} - \sum_{k=2}^{i} \Delta \epsilon^{(k)}_{fwd}$ .
- This effectively forces the backward path to follow the time-reversed motion trajectory of the forward path.
Fusion: The final estimate is a fusion of the original forward estimate and the reconstructed backward estimate (which now shares the same motion prior).
Strategic Timing: MPD is applied only during the early denoising steps (controlled by a ratio $\gamma$ ). Early steps define the global trajectory and low-frequency structure, while later steps refine high-frequency details. Applying MPD early ensures the global motion is aligned before details are added.

Algorithm Overview

Input: Start frame ( $I_{start}$ ), End frame ( $I_{end}$ ), Noise schedule.
Process:
- Run forward denoising with $c_{start}$ .
- Extract motion residuals.
- Initialize backward path with $I_{end}$ but do not use $c_{end}$ for denoising guidance.
- Inject forward residuals into the backward path to align trajectories.
- Fuse estimates and proceed with standard Euler steps.
Output: Temporally coherent intermediate frames.

3. Key Contributions

Analysis of Motion Prior Conflict: The paper formally identifies and analyzes the "motion prior conflict" in existing time reversal sampling methods, attributing artifacts to the inherent forward-generation bias of I2V models.
Motion Prior Distillation (MPD): A novel, training-free inference technique that resolves bidirectional misalignment by distilling the motion residual from the forward path into the backward path, effectively converting the problem into a single-path alignment task.
Single-Path Design: By deliberately avoiding the denoising of the end-conditioned path, the method eliminates the source of ambiguity and ensures the backward path converges to the forward motion prior.
Comprehensive Evaluation: The authors provide extensive quantitative benchmarks, qualitative visual comparisons, and a rigorous user study to validate the method's superiority over state-of-the-art (SOTA) baselines.

4. Experimental Results

The method was evaluated on the DAVIS and Pexels datasets using various metrics and compared against baselines like TRF, ViBiD, GI, FCVG, FILM, and DynamiCrafter.

Quantitative Performance:
- MPD consistently outperformed all baselines in FID (Fréchet Inception Distance) and FVD (Fréchet Video Distance), indicating superior visual quality and temporal coherence.
- It achieved significant improvements in VBench++ scores, particularly in motion smoothness and temporal consistency.
- Note: While flow-based methods (FILM) sometimes scored slightly higher in specific alignment metrics, they suffered from blurry artifacts and poor long-range consistency, resulting in worse FVD scores.
Qualitative Performance:
- Baselines (TRF, ViBiD) exhibited "back-and-forth" motion, ghosting, and subjects vanishing or reversing direction.
- MPD produced smooth, natural trajectories where objects maintained consistent motion direction and identity throughout the sequence.
User Study:
- Conducted on Amazon Mechanical Turk with 30 participants.
- MPD achieved the highest ranking for naturalness and temporal coherence.
- It had the lowest rates of reported artifacts and unrealistic motion compared to all other methods.
Ablation Studies:
- Distillation Ratio ( $\gamma$ ): Applying MPD only in early steps (e.g., $\gamma=0.2$ or $0.3$) yielded the best results. Applying it throughout the entire process ( $\gamma=1.0$ ) degraded performance by introducing pixel-level biases.
- Re-noising Steps ( $k$ ): A small number of re-noising steps (2-3) helped steer the trajectory effectively without over-complicating the process.

5. Significance

This work represents a significant advancement in training-free generative inbetweening.

Efficiency: Unlike methods requiring fine-tuning (e.g., GI, FCVG) or additional training (DynamiCrafter), MPD is a plug-and-play inference strategy that works with pre-trained models like Stable Video Diffusion (SVD).
Robustness: It solves a fundamental limitation of current diffusion-based video generation: the inability to handle dual-constraint generation without motion conflict.
Practicality: By improving temporal coherence and reducing artifacts, MPD makes generative inbetweening viable for real-world applications such as video editing, animation interpolation, and content creation where precise control over start and end states is required.

In summary, the paper proposes a simple yet powerful mechanism to align conflicting motion priors in diffusion models, achieving state-of-the-art results in generative inbetweening without the need for additional model training.