Velocity Disambiguation for Video Frame Interpolation

The Problem: The "Blurry Middle" Mystery

Imagine you are watching a video of a baseball being thrown from a pitcher to a catcher.

Frame A: The ball is in the pitcher's hand.
Frame B: The ball is in the catcher's mitt.

Now, imagine you want to create a "slow-motion" video by inserting a new frame right in the middle (Frame C).

The Old Way (Time Indexing):
The computer is told: "Hey, create a frame that happens exactly halfway in time between the start and the end."

The problem? The computer doesn't know how the ball moved.

Did the ball fly at a constant speed? (It would be right in the middle).
Did the ball start slow and speed up? (It would be closer to the pitcher).
Did the ball start fast and slow down? (It would be closer to the catcher).
Did it curve? (It could be anywhere).

Because the computer doesn't know the speed or direction, it tries to be safe. It guesses the ball is somewhere in the middle, but since it's unsure, it averages all the possibilities. The result? A blurry, ghost-like ball that looks like a smear. It's like trying to draw a picture of a car moving by painting every possible spot it might have been in, resulting in a fuzzy mess.

The Solution: "Distance Indexing" (The Ruler Approach)

The authors propose a smarter way to talk to the computer. Instead of asking, "Where is the ball at 50% of the time?", they ask: "Where is the ball at 50% of the distance?"

The Analogy:
Think of the ball's path as a road trip from New York to Los Angeles.

Time Indexing is like saying, "Stop the car exactly 3 hours into the drive." (But we don't know if the car was stuck in traffic or speeding on the highway, so we don't know where it is).
Distance Indexing is like saying, "Stop the car exactly halfway across the country."

By giving the computer a "distance map" (a ruler measuring how far the object has traveled), the computer no longer has to guess the speed. It knows exactly where the object should be based on how far it has gone. This removes the guesswork and results in a crisp, sharp image of the ball.

The Second Problem: The "Which Way?" Confusion

Even with the distance ruler, there's still a tiny problem. If the ball is halfway across the country, did it go straight there, or did it take a detour through the mountains?

If the computer guesses the wrong path, the image is still a little blurry.

The Fix: The "Step-by-Step" Strategy
Instead of trying to jump from New York to LA in one giant leap, the computer breaks the trip into small, manageable steps.

First, it figures out where the ball is at 25% of the distance.
Then, it uses that new, clear image as a reference to figure out where the ball is at 50%.
It keeps doing this, taking small, confident steps rather than one giant, confused leap.

This is called Iterative Reference-Based Estimation. It's like walking across a dark room by feeling the wall step-by-step, rather than trying to guess the whole path in the dark.

The Superpower: Editing Reality

Because the computer now understands "distance" instead of just "time," we can do something magical: We can control individual objects.

Imagine a video of a person walking a dog.

Old Way: You can only slow down the whole video. Both the person and the dog slow down together.
New Way: You can tell the computer, "Keep the person moving at normal speed, but make the dog walk backward in time!"

You can draw a mask around the dog and tell it to travel a different "distance curve" than the person. This allows for incredible video editing tricks, like making a car drive backward while the background moves forward, or making a falling apple hover in mid-air.

The "Multi-Frame" Upgrade (The Detective)

Sometimes, just looking at the start and end frames isn't enough to know the exact path. The authors added a feature where the computer can peek at frames before the start and after the end.

The Analogy:
If you are trying to guess the path of a car, looking at just the start and end points is hard. But if you can also see the car 1 second before it started and 1 second after it finished, you can see its acceleration and direction much better. This "Multi-Frame Refiner" acts like a detective gathering more clues to draw a perfect, sharp picture.

Summary

The Problem: Computers make blurry videos because they guess the speed of moving objects.
The Fix: Instead of guessing "time," we tell the computer "distance." This makes the images sharp.
The Boost: We break big jumps into small steps to fix any remaining confusion about direction.
The Magic: This lets us edit videos by moving individual objects (like a dog or a car) independently of the rest of the scene.

The result is slow-motion videos that look incredibly realistic, sharp, and editable, without needing expensive cameras or extra computing power.

1. Problem Statement: Velocity Ambiguity

The core problem addressed is velocity ambiguity in Video Frame Interpolation (VFI), particularly in arbitrary-time interpolation (generating frames at any $t \in [0, 1]$ ).

The Limitation of Time Indexing: Existing methods typically use "time indexing," where the model takes two frames ( $I_0, I_1$ ) and a scalar time variable $t$ as input to predict an intermediate frame $I_t$ .
The One-to-Many Mapping: Given only start and end frames, there are infinitely many possible trajectories an object could take to reach its destination at time $t$ . An object could be accelerating, decelerating, or moving along a curved path.
The Consequence: During training, the model receives identical inputs ( $I_0, I_1, t$ ) but must predict different outputs depending on the hidden ground-truth motion. This forces the network to learn a "one-to-many" mapping. Consequently, the model tends to converge to a weighted average of all possible outcomes, resulting in blurry, imprecise frames (mode averaging) rather than sharp, distinct motion.
Directional Ambiguity: Even if speed is known, the direction of motion for long-range interpolation (e.g., $t=0.5$ ) remains ambiguous without additional constraints.

2. Methodology

The authors propose a plug-and-play framework to resolve this ambiguity without requiring architectural changes to existing VFI models. The solution consists of three main components:

A. Distance Indexing (Disambiguating Speed)

Instead of providing a scalar time $t$ , the authors introduce a Distance Indexing Map ( $D_t$ ).

Concept: $D_t(x, y)$ represents the ratio of the distance an object at pixel $(x, y)$ has traveled relative to the total distance between $I_0$ and $I_1$ .
Training: $D_t$ is derived from ground-truth optical flow ratios: $D_t = \frac{V_{0 \to t} \cdot V_{0 \to 1}}{\|V_{0 \to 1}\|^2}$ . This transforms the problem from a one-to-many time-to-location mapping into a deterministic one-to-one distance-to-location mapping.
Inference: Since ground-truth flow is unavailable at inference, a uniform map ( $D_t(x, y) = t$ ) is used. This assumes constant velocity, which is a valid approximation for many real-world scenarios and significantly reduces uncertainty compared to raw time indexing.

B. Iterative Reference-Based Estimation (Disambiguating Direction)

To address directional ambiguity (especially for large time gaps), the authors propose breaking long-range predictions into short-range steps.

Strategy: Instead of predicting $I_t$ directly from $I_0$ and $I_1$ , the model predicts intermediate frames iteratively.
Mechanism: The model takes the start/end frames, the target distance map, and a reference frame ( $I_{ref}$ $I_{r e f}$ ) with its corresponding distance map ( $D_{ref}$ $D_{r e f}$ ) as input.
- Example: To predict $I_t$ , first predict $I_{t/2}$ using $I_0, I_1, D_{t/2}$ . Then predict $I_t$ using $I_0, I_1, D_t$ and the newly generated $I_{t/2}$ as a reference.
Benefit: This "divide-and-conquer" approach constrains the search space at each step, minimizing directional uncertainty and preventing error accumulation.

C. Multi-Frame Fusion & Continuous Map Estimation

To further enhance performance and enable pixel-wise accurate interpolation:

Continuous Map Estimator: Using four input frames ( $I_{-1}, I_0, I_1, I_2$ ), the authors employ a Cubic B-spline and Neural ODE (inspired by CPFlow) to estimate a dense, pixel-wise continuous distance map. This allows for non-uniform motion modeling.
Multi-Frame Refiner: A trainable copy of the original VFI network is used to refine the initial two-frame interpolation by fusing information from the additional neighboring frames ( $I_{-1}, I_2$ ).

D. Manipulated Interpolation of Anything

By combining the distance indexing with segmentation models (like SAM), users can manually specify different distance curves for different objects. This enables unique video editing tasks, such as making specific objects move backward in time or change speed independently of the background.

3. Key Contributions

Distance Indexing Paradigm: A novel input representation that replaces scalar time with a distance ratio map, effectively resolving speed ambiguity and transforming the learning task from one-to-many to one-to-one.
Iterative Reference-Based Estimation: A strategy to resolve directional ambiguity by decomposing long-range motion into short, deterministic steps using intermediate references.
Plug-and-Play Compatibility: The methods require only input channel modifications, allowing seamless integration into state-of-the-art models (e.g., RIFE, IFRNet, AMT, EMA-VFI) without retraining the core architecture.
Advanced Multi-Frame Extension: A continuous distance map estimator and a multi-frame refiner architecture that leverage additional frames for pixel-aligned, high-fidelity interpolation.
Novel Editing Capability: The ability to manipulate individual object trajectories via custom distance maps, enabling "Manipulated Interpolation of Anything."

4. Experimental Results

The authors evaluated their approach on standard benchmarks (Vimeo90K, X4K1000FPS, Adobe240) across multiple SOTA models.

Qualitative Improvements: Visual results show significantly sharper frames with reduced blur and artifacts compared to time-indexed baselines.
Perceptual Metrics: The proposed methods ([D] and [D, R]) consistently outperform baselines on perceptual metrics like LPIPS and NIQE, indicating higher visual quality.
Pixel-Centric Metrics: While using uniform distance maps at inference leads to slight misalignment with ground truth (lowering PSNR/SSIM slightly in some cases), the perceptual quality is superior. When using estimated dense maps with multi-frame inputs, both perceptual and pixel metrics improve significantly.
User Study: In a study with 30 participants, the combined method (Distance Indexing + Iterative Reference) was ranked as the best in terms of perceived quality.
Generalization: The strategy improves performance across diverse motion patterns (acceleration, deceleration, constant speed) and works effectively on diffusion-based models (LDMVFI) and transformer-based models (VFI-Transformer).

5. Significance

This paper fundamentally shifts the paradigm of Video Frame Interpolation from time-based to motion-based indexing.

Theoretical Impact: It identifies velocity ambiguity as a primary bottleneck in learning-based VFI and provides a mathematical and practical solution to resolve it.
Practical Impact: The "plug-and-play" nature allows immediate adoption by existing models, offering a low-cost, high-reward upgrade for video generation, slow-motion creation, and compression.
Creative Impact: The ability to decouple object motion from global time indexing opens new avenues for video editing, allowing for granular control over individual object dynamics within a scene.

In summary, the paper demonstrates that by explicitly guiding the network on how far an object has traveled rather than when it is, VFI models can overcome inherent ambiguities to produce sharper, more accurate, and controllable video interpolations.