Arbitrary Generative Video Interpolation

Imagine you are watching a movie, but the director only gave you the first frame (a character standing still) and the last frame (the character sitting down). Your job is to fill in the missing scenes in between so the movement looks smooth.

For a long time, video AI tools could only do this in a very rigid way: "Give me exactly 5 frames to fill the gap." If you wanted 10 frames, or 3, or 100, the tools would break or give you a choppy result. It was like trying to fit a specific number of bricks into a gap; if the gap size changed, you had to build a whole new wall.

This paper introduces ArbInterp, a new system that changes the rules. Think of it as a magical time-machine painter that can fill in any amount of space, at any point in time, with perfect smoothness.

Here is how it works, broken down into simple concepts:

1. The "Time-Map" Problem (TaRoPE)

The Old Way: Imagine a train where every car is numbered 1, 2, 3, 4. The AI knows "Car 3" is always in the middle. If you ask for a car between 1 and 2, the AI gets confused because there is no "Car 1.5." It only understands whole numbers.

The New Way (ArbInterp): The authors gave the AI a continuous time-map. Instead of car numbers, they use a clock.

The start frame is at 0:00.
The end frame is at 1:00.
Now, the AI can be asked to paint a frame for 0:23, 0.55, or 0.99.

They achieved this using a clever trick called TaRoPE (Timestamp-aware Rotary Position Embedding). Think of this as giving every frame a GPS coordinate on a timeline rather than a seat number. This allows the AI to understand that "0.5" is exactly halfway, regardless of how many total frames you want to generate. It's like telling a chef, "Cook the soup for exactly 4 minutes and 12 seconds," instead of "Cook it for 4 minutes."

2. The "Long Movie" Problem (Segmenting)

The Challenge: What if you want to interpolate a whole hour-long video? You can't ask the AI to paint 3,600 frames in one go; it would get overwhelmed and the end of the video would look nothing like the beginning (the character's shirt might change color, or the background might shift).

The Solution: ArbInterp breaks the long video into small chapters.

It paints the first 10 seconds.
Then it paints the next 10 seconds.
And so on.

But here's the tricky part: How do you make sure the end of Chapter 1 matches the start of Chapter 2 perfectly? If you just stitch them together, it might look like a jump cut.

3. The "Appearance vs. Motion" Trick

To solve the stitching problem, the authors invented a Decoupling Strategy. Imagine you are directing a play with two different actors playing the same character in two different acts.

Appearance (The Costume): To make sure the character looks the same, the AI takes the very last frame of the previous chapter and uses it as a "ghost guide" for the next chapter. It says, "Hey, make sure the shirt and face look exactly like this."
Motion (The Dance): To make sure the movement is smooth, the AI doesn't just look at the picture; it extracts the "dance moves" (motion tokens) from the previous scene. It tells the next scene, "Keep dancing the same way you were just dancing."

By separating the look (appearance) from the movement (motion), the AI can stitch long videos together seamlessly, like a perfect relay race where the baton is passed without dropping it.

Why This Matters

Before this, if you wanted to slow down a video or speed it up, you were stuck with the options the software gave you. You couldn't just say, "I need a frame right here."

ArbInterp is like giving you a slider instead of a set of buttons.

Want to turn a 1-second clip into a 2-second slow-motion? Done.
Want to turn it into a 10-second dreamy slow-mo? Done.
Want to insert a frame at a weird, specific moment in time? Done.

It makes video creation much more flexible, allowing creators to control the "flow of time" in their videos with the precision of a surgeon, rather than the guesswork of a gambler.

In a Nutshell

The paper presents a system that treats video time as a continuous river rather than a series of stepping stones. By giving the AI a precise clock (TaRoPE) and a smart way to pass the baton between scenes (Appearance-Motion Decoupling), it can generate smooth, high-quality video frames for any duration and any speed, solving a problem that has limited video AI for years.

1. Problem Statement

Generative Video Frame Interpolation (VFI) aims to synthesize coherent intermediate frames between a given start and end frame. While recent advances in text-to-video and image-to-video models have improved generative quality, existing generative VFI methods suffer from two critical limitations:

Fixed Interpolation Paradigm: Current methods can only generate a predetermined, fixed number of intermediate frames. This restricts flexibility in adjusting frame rates (FPS) or video duration during the creation process.
Inflexible Temporal Modeling: Existing approaches rely on fixed positional indices (e.g., frame 1, frame 2) rather than continuous time. This prevents the model from reasoning about continuous motion dynamics, making it difficult to generate frames at arbitrary, non-equidistant timestamps or to scale to very long sequences without quality degradation.

2. Methodology: ArbInterp Framework

The authors propose ArbInterp, a novel framework built upon the pre-trained video generation model Wan (Wang et al., 2025). The core innovation lies in decoupling frame generation from fixed indices and enabling control via continuous timestamps.

A. Timestamp-aware Rotary Position Embedding (TaRoPE)

To enable interpolation at any timestamp, the authors modify the temporal position encoding mechanism.

Mechanism: Standard DiT-based video models use temporal Rotary Position Embedding (RoPE) where a frame's position is determined by its integer index ( $k$ ). ArbInterp replaces this with a continuous normalized timestamp ( $t \in [0, 1]$ ).
Implementation: For a video with $N$ frames, the $k$ -th frame's timestamp is calculated as $t_k = (k-1)/(N-1)$ . The RoPE rotation angle is then derived from this continuous $t_k$ rather than the integer index.
Benefit: This allows the model to perceive the actual relative time of a frame within the interval $[0, 1]$ , enabling the generation of frames at arbitrary continuous timestamps (e.g., $t=0.33$ or $t=0.77$ ) without retraining the entire architecture from scratch.

B. Segment-wise Generation for Arbitrary Length

To handle long videos or high interpolation factors (e.g., 32x, 256x) without exceeding memory limits or training constraints:

Decomposition: Long sequences are decomposed into smaller segments.
Strategies: The paper proposes three inference strategies:
1. Direct Interpolation: For short sequences (single forward pass).
2. Segment-by-Segment: Sequential generation of non-overlapping segments.
3. Hierarchical Interpolation: Generating sparse anchor frames first, then refining intervals between them.

C. Appearance-Motion Decoupled Conditioning

A major challenge in segment-wise generation is maintaining consistency (no flickering or motion jumps) between adjacent segments. The authors introduce a dual-conditioning strategy:

Appearance Consistency: The last frame of the previous segment is fed as a prefix frame (input latent) to the current segment. This ensures visual continuity and prevents content drift.
Motion Coherence: A Motion Semantic Extractor (MSE) extracts motion tokens from the last $N$ $N$ frames of the previous segment. These tokens are injected into the current segment's generation process via cross-attention.
- The MSE uses a temporally enhanced CLIP model to extract spatio-temporal features, which are compressed into fixed-length motion tokens via a Q-Former.
- This allows the model to "remember" the motion dynamics of the previous segment and continue them smoothly, effectively decoupling static appearance from dynamic motion.

3. Key Contributions

Arbitrary Timestamp Control: Introduction of TaRoPE, which allows generative models to synthesize frames at any continuous timestamp within a start-end interval, breaking the fixed-frame-rate constraint.
Infinite-Length Interpolation: A novel appearance-motion decoupling conditioning strategy that enables seamless, high-fidelity interpolation for arbitrarily long videos by maintaining spatiotemporal coherence across segments.
MultiInterpBench: The construction of a comprehensive benchmark covering multi-scale interpolation factors (2x, 8x, 16x, 32x, and even 256x) to rigorously evaluate flexibility and generalization.
Efficiency: The method requires only minimal fine-tuning (20k steps on 8 GPUs) of a pre-trained model, demonstrating high parameter efficiency.

4. Experimental Results

The authors evaluated ArbInterp against state-of-the-art methods (LDMVFI, DynamiCrafter, TRF, GI) on the MultiInterpBench.

Quantitative Performance: ArbInterp significantly outperforms all baselines across all interpolation ratios (2x to 32x).
- Fidelity: Achieves lower FID (44.9 vs 59.1 for SVD-based baseline at 2x) and FVD scores.
- Consistency: Achieves higher scores in VBench metrics for Subject Consistency, Background Consistency, and Motion Smoothness.
- Scalability: At extreme interpolation rates (32x and 256x), ArbInterp maintains high quality where other methods suffer from severe artifacts or motion discontinuity.
Qualitative Performance: Visual comparisons show that ArbInterp generates smoother motion trajectories and more consistent appearances compared to iterative prediction methods used by baselines.
Ablation Studies:
- TaRoPE vs. Vanilla: TaRoPE significantly improves motion smoothness and temporal flicker metrics compared to standard MLP-based timestamp injection.
- Decoupling Strategy: The combination of prefix frames (appearance) and motion tokens (motion) yields the best results, outperforming simple latent concatenation or cross-attention alone.

5. Significance and Impact

Paradigm Shift: ArbInterp shifts video interpolation from a "fixed-frame" task to a "continuous-time" task, aligning generative video models more closely with the physical reality of continuous motion.
Practical Utility: The ability to generate frames at arbitrary timestamps is crucial for applications like variable frame-rate conversion, slow-motion generation, and interactive video editing where users need precise control over timing.
Scalability: The segment-wise approach with decoupled conditioning solves the memory and coherence bottlenecks of long-video generation, paving the way for infinite-length video interpolation.
Future Direction: The paper suggests that integrating text guidance and scaling up datasets/models could further enhance controllability, moving towards a fully unified generative video editing framework.

In summary, ArbInterp represents a significant advancement in generative video interpolation by introducing continuous temporal awareness and robust long-sequence consistency mechanisms, setting a new standard for flexibility and quality in video frame synthesis.