RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

Imagine you have a home video of your family at the beach, and you want to change the style. Maybe you want it to look like a Chinese ink painting, or perhaps you want to remove your uncle who accidentally walked into the frame, or even turn the ocean into a Van Gogh masterpiece.

Doing this frame-by-frame is easy for a computer, but doing it for a whole video without the result looking like a jittery, glitchy mess is incredibly hard. This is the problem RFDM (Residual Flow Diffusion Model) solves.

Here is the story of how RFDM works, explained without the jargon.

The Problem: The "Photocopier" vs. The "Storyteller"

Imagine you have a magic photocopier (an Image-to-Image model) that can turn a photo of a beach into a painting.

The Naive Approach: If you want to edit a 10-second video, you could just run every single frame through this photocopier one by one.
- The Result: The first frame looks great. The second frame looks great. But because the photocopier is a bit "random" (like a dice roll), the third frame might look slightly different in style than the second. When you play them back, the video shakes and jitters like a nervous dancer. The style doesn't flow; it flickers.
The Old "Smart" Approach: Some previous methods tried to fix this by looking at the entire video at once to make sure it's smooth.
- The Result: It looks smooth, but it's like trying to carry a 500-pound piano up a flight of stairs. It requires massive computer power (RAM) and takes forever. You can't do this on a phone or for a live stream.

The Solution: The "Residual Flow" Detective

RFDM is like a clever detective who solves the video editing mystery by looking at what changed, not the whole picture every time.

Here is the analogy:

1. The "Residual" Trick (The "What's New?" Game)

Imagine you are painting a mural.

The Old Way: For every new frame, you repaint the entire wall from scratch, even the parts that didn't change. This is slow and prone to mistakes because you might paint the sky slightly blue in one frame and slightly purple in the next.
The RFDM Way: RFDM looks at the previous frame and asks, "What is different here?"
- If the background (the sky) stayed the same, it says, "Okay, I'll just copy that part exactly."
- If the prompt says "Make the water Van Gogh style," it only focuses on painting the water.
- It treats the video as a series of small changes (residuals) rather than whole new pictures. This is why it's called Residual Flow. It flows from one frame to the next, only updating the parts that need changing.

2. The "Causal" Chain (The Storyteller)

RFDM is causal, meaning it tells the story in order, from start to finish.

It edits Frame 1.
Then, it takes the result of Frame 1 and uses it as a guide to edit Frame 2.
Then it takes Frame 2 to edit Frame 3.
It's like a game of Telephone, but instead of the message getting garbled, the AI uses the previous frame to keep the style consistent. Because it only looks at the past (not the future), it can edit a video of any length instantly, without needing to load the whole movie into memory first.

Why is this a Big Deal?

The paper compares RFDM to two other methods:

The Jittery Photocopier (I2I): Fast, but the video shakes.
The Heavy Hauler (3D Models): Smooth, but requires a supercomputer and can't handle long videos.
RFDM: It gets the smoothness of the Heavy Hauler but runs with the speed and lightness of the Photocopier.

The Analogy:

Fairy (a previous method) is like a heavy truck trying to drive on a dirt road. It gets the job done smoothly, but it burns a lot of fuel (RAM) and moves slowly.
RFDM is like a motorcycle. It weaves through traffic (frames) effortlessly, uses very little fuel, and arrives at the destination just as smoothly as the truck, but much faster.

The New "Report Card" (The Benchmark)

The authors also realized that the old ways of grading these AI videos were flawed.

Old Grading: "Does the video sound like the text prompt?" (e.g., Does it say "Van Gogh" in the description?)
New Grading (Señorita Benchmark): "Did the AI actually do what you asked without messing up the rest of the video?"
- They introduced a "Judge" (an AI judge) that looks at the video and says, "Hey, you changed the person's shirt, but you also accidentally turned the sand pink. That's a bad edit."
- RFDM scored the highest on this new, stricter test because it is very good at only changing what you asked for.

Summary

RFDM is a new video editing tool that:

Edits frame-by-frame (like a storyteller) so it can handle videos of any length.
Only paints the changes (Residual Flow) so the video stays smooth and doesn't jitter.
Runs on regular computers (efficient) instead of needing a supercomputer.
Follows instructions perfectly, changing the style or removing objects without breaking the rest of the scene.

It's the difference between trying to edit a video by hand with a shaky camera versus having a smooth, intelligent assistant that knows exactly what to change and what to leave alone.

1. Problem Statement

Instructional video editing aims to modify input videos using natural language prompts (e.g., "remove the object" or "change style to anime") without requiring additional conditioning signals like masks or optical flow. Current state-of-the-art methods face two primary limitations:

Non-Causal & Fixed-Length: Most existing models rely on non-causal temporal mechanisms (processing all frames simultaneously), requiring fixed-length inputs and high computational resources. This makes them unsuitable for real-time applications, video streaming, or resource-constrained devices.
Inconsistency vs. Efficiency Trade-off: Naive approaches that apply Image-to-Image (I2I) models frame-by-frame are computationally efficient but suffer from temporal inconsistency (jittering). Conversely, methods that enforce consistency (like Fairy) often do so by smoothing results at the cost of massive computational overhead and increased memory usage.
Evaluation Gaps: Existing benchmarks often rely on text-image similarity metrics (CLIP), which fail to accurately measure faithfulness (preserving unedited regions) and temporal consistency (smooth motion) in video editing.

2. Methodology: RFDM

The authors propose RFDM (Residual Flow Diffusion Model), a causal, autoregressive video editing framework built upon efficient 2D Image-to-Image (I2I) diffusion backbones (specifically Stable Diffusion 1.5 and 3.5).

Core Architecture

Causal Autoregressive Generation: Unlike non-causal models, RFDM edits videos frame-by-frame. The prediction for the current frame $t$ is conditioned on the model's own prediction from the previous frame $\hat{y}_{t-1}$ . This allows for variable-length video synthesis with no fixed input constraints.
Residual Flow Prediction (Key Innovation): Instead of generating the full target frame $y^0_t$ $y_{t}^{0}$ from pure noise, RFDM reformulates the diffusion forward process to predict the temporal residual between the target frame and the previous prediction.
- Standard Forward Process: $y_s = \alpha_s y^0 + \sigma_s \epsilon$
- RFDM Forward Process: The noise mean is shifted toward the previous prediction $\hat{y}_{t-1}$ . The model effectively learns to predict the difference (residual) $m^0_t = \hat{y}_{t-1} - y^0_t$ .
- Benefit: This forces the model to focus denoising efforts only on regions that have changed (motion or editing instructions), leveraging temporal redundancy to improve consistency and efficiency.
Handling Exposure Bias: To prevent error accumulation during inference (where the model relies on its own noisy predictions rather than ground truth), the authors employ Diffusion Forcing. During training, they sample different noise levels for different frames and use the model's own denoised prediction from $t-1$ as the condition, rather than the ground truth (Teacher Forcing). This bridges the gap between training and inference distributions.

Training & Inference

Training: Trained on the Se˜norita dataset (2M paired video clips). The model uses a standard I2I backbone (UNet) but is conditioned on the previous frame's clean output and the current input frame.
Inference: The first frame is edited from pure noise (standard I2I). Subsequent frames start from a noise distribution shifted by the previous frame's prediction. The process uses Classifier-Free Guidance (CFG) to adhere to text prompts.

3. Key Contributions

Efficient Causal Video Editing: RFDM is the first method to adapt 2D I2I diffusion models for autoregressive video editing without adding computational overhead compared to frame-by-frame I2I baselines. It scales independently of video length.
Residual Flow Formulation: A novel diffusion forward process that shifts the sampling noise mean to the previous prediction, encouraging the model to learn inter-frame residuals. This significantly improves temporal consistency and reduces jitter.
New Benchmark & Metrics: The authors introduce the Se˜norita Benchmark and new metrics to better evaluate video editing:
- ViDreamSim: Measures faithfulness to ground truth.
- Error Accumulation: Quantifies drift in autoregressive models over time.
- MLLM-as-a-Judge: Uses GPT-4o to rate instruction adherence and visual coherence, providing a holistic score.
Comprehensive Evaluation: Extensive ablation studies on autoregressive frame counts, input conditioning, and forcing strategies to optimize the trade-off between consistency and faithfulness.

4. Experimental Results

The model was evaluated on three benchmarks: TGVE, TGVE+, and the new Se˜norita benchmark across three tasks: Global Style Transfer, Local Style Transfer, and Object Removal.

Performance vs. 3D Models: RFDM (specifically the SD3.5 variant) competes with heavy 3D Spatio-Temporal models (like EVE) in terms of editing quality and prompt alignment, despite using a much smaller 2D backbone.
Performance vs. 2D Baselines: RFDM significantly outperforms other I2I-based methods (like Fairy and VidToMe) in faithfulness and temporal consistency.
- Fairy often leaves artifacts or alters irrelevant regions.
- VidToMe produces smooth videos but deviates significantly from the ground truth.
- RFDM achieves the highest CLIPFrame scores (indicating consistency) and best MLLM-Judge scores.
Efficiency:
- Latency: RFDM matches the latency of Fairy but uses ~13x less RAM.
- Scalability: It is ~4x faster than other baselines and scales linearly with video length, unlike fixed-length 3D models.
Ablation Insights:
- Predicting residuals (vs. full frames) reduces error accumulation.
- Using Diffusion Forcing (vs. Teacher Forcing) improves faithfulness to the ground truth.
- Updating the previous frame condition every 3 frames (instead of every frame) optimally balances error accumulation and temporal consistency.

5. Significance

RFDM represents a paradigm shift in video editing by demonstrating that causal, autoregressive generation is not only feasible but highly efficient for this task. By decoupling video editing from heavy 3D architectures, it enables:

Real-time applications: Suitable for video streaming and interactive editing.
Edge Deployment: The low RAM footprint allows deployment on resource-constrained devices (e.g., smartphones).
Scalability: The ability to edit videos of arbitrary length without retraining or fixed-window constraints.

The paper also highlights the critical need for better evaluation metrics in video generation, moving beyond simple text-image similarity to metrics that rigorously test temporal coherence and instruction faithfulness.