PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

Imagine you have a home video of your dog running through a park. You want to edit it so the dog looks like a dragon, but you want the dragon to run exactly the same way the dog did, with the same background and the same wind blowing through the trees.

Doing this manually is a nightmare. Doing it with current AI is like trying to describe the dragon to a painter who keeps forgetting what you said, or who paints the dragon but forgets to make it run.

PropFly is a new method that solves this problem. Here is how it works, explained with simple analogies:

1. The Problem: The "Pair" Dilemma

To teach an AI to edit videos, you usually need a massive library of "Before and After" pairs. You need thousands of videos showing a "normal car" and the exact same video showing a "cyberpunk car."

The Issue: Making these pairs is incredibly expensive and slow. It's like hiring an army of artists to redraw every single frame of every movie to create a training dataset.

2. The Solution: The "Instant Tutor" (On-the-Fly Supervision)

Instead of hiring an army of artists to pre-make the training data, PropFly uses a pre-trained AI (a "Video Diffusion Model") as a live tutor.

Think of this pre-trained AI as a master chef who knows how to cook any dish perfectly.

Old Way: You ask the chef to cook 10,000 meals, save them, and then try to learn from them.
PropFly Way: You ask the chef to cook a meal right now, but you ask for two versions of it instantly:
1. Version A (The Source): A plain, standard dish (e.g., a plain burger).
2. Version B (The Target): A fancy, spicy version of the exact same dish (e.g., a spicy burger).

The magic is that the chef creates both versions from the same ingredients at the exact same moment. Because they come from the same "source," the bun, the patty, and the bun's shape are identical. The only difference is the spice level (the style or object change).

3. The Secret Sauce: The "Volume Knob" (CFG)

How does the chef make two different dishes instantly?
PropFly uses a setting called Classifier-Free Guidance (CFG), which acts like a volume knob for creativity.

Low Volume (Low CFG): The AI follows the instructions strictly but stays very close to the original video. It's the "Source."
High Volume (High CFG): The AI turns up the creativity. It takes the same video and applies a heavy "style filter" (e.g., turning it into a painting, changing the weather, or swapping the object). This is the "Target."

Because the AI generates both from the same "noisy" starting point, the motion (the dog running) stays perfectly synchronized, but the look (dog vs. dragon) changes.

4. The Student: The "Adapter"

PropFly doesn't retrain the whole master chef (the big AI model). Instead, it attaches a small, trainable adapter (a "student") to the chef.

The student watches the Source (the plain video) and the Target (the spicy video).
The student learns: "Oh, when I see a plain burger and the user wants a spicy one, I need to add these specific spices to the whole video, not just the first frame."
The student learns to propagate the change. It learns how to take the "spicy" look from the first frame and apply it to every single frame that follows, keeping the motion smooth.

5. The Result: Robust Editing

Because the student learned from these "instantly generated" pairs, it becomes a master at editing.

Text-Guided Methods (The Old Way): You tell the AI "Make it a dragon," and it might make a dragon that walks like a human or forgets the background.
PropFly: You show it one frame of a dragon, and it says, "Got it! I know exactly how to turn the whole video into a dragon while keeping the dog's running motion perfectly intact."

Summary Analogy

Imagine you are learning to paint a moving car.

Old Method: You are given a photo of a red car and a photo of a blue car, but they are slightly different angles. You have to guess how to paint the blue car moving.
PropFly: You have a magic camera. You take a picture of a red car. You turn a dial, and instantly, the camera shows you the same car, in the exact same pose, but painted blue. You practice this thousands of times in a row. Now, you can look at any red car video and instantly know how to paint it blue while keeping the wheels spinning perfectly.

In short: PropFly teaches an AI to edit videos by letting it practice on "instantly generated" examples, so it doesn't need a massive, expensive library of pre-made examples to learn how to be creative and consistent.

1. Problem Statement

Video editing requires precise control over visual content while maintaining temporal consistency (motion and structure). Two dominant paradigms exist:

Text-guided editing: Uses text prompts to modify videos. While intuitive, it often struggles to reflect fine-grained user intent or preserve the original video's motion and structure perfectly.
Propagation-based editing: Involves editing a single frame and propagating that change to the rest of the video. This offers precise control but faces a critical bottleneck: data scarcity. Training such models typically requires large-scale, paired datasets (source video + edited video), which are expensive, complex to acquire, and often limited in scope (e.g., only supporting local object swaps, not global style transfers).

Existing solutions to generate training data (e.g., using segmentation masks or iterative diffusion inference) are either too narrow in scope or computationally prohibitive.

2. Methodology: PropFly

The authors propose PropFly, a novel training pipeline that eliminates the need for pre-computed paired datasets. Instead, it generates supervision signals on-the-fly using a frozen, pre-trained Video Diffusion Model (VDM).

Core Insight

The method leverages Classifier-Free Guidance (CFG) in diffusion models. The authors observe that varying the CFG scale ( $\omega$ ) while sampling from the same noisy latent state produces outputs that are:

Structurally Aligned: They share the same motion and spatial structure.
Semantically Distinct: Higher CFG scales enforce stronger adherence to the text prompt (e.g., changing style, color, or object), while lower scales retain more of the original context.

Pipeline Components

On-the-Fly Data Pair Generation:
- Instead of running a full, iterative denoising process (which is slow and accumulates errors), PropFly uses a one-step clean latent estimation.
- From a single intermediate noisy latent $x_t$ , the model predicts the clean latent $\hat{x}_{0|t}$ using two different CFG scales: a low scale ( $\omega_L$ , e.g., 1.0) for the Source and a high scale ( $\omega_H$ , e.g., 7.0) for the Target.
- This creates a perfectly aligned pair $(\hat{x}^{low}_{0|t}, \hat{x}^{high}_{0|t})$ where the source provides structural context and the target provides the desired transformation.
Random Style Prompt Fusion (RSPF):
- To ensure the model learns diverse transformations, the text prompt is augmented by fusing random style descriptors (e.g., "in snow," "cyberpunk style") with the original video caption. This enriches the training distribution.
Model Architecture:
- The system uses a frozen pre-trained VDM backbone (e.g., Wan2.1).
- A lightweight, trainable adapter is attached to the backbone.
- Input: The adapter receives the entire source video latent (for structure) and the first frame of the target (edited) latent (for style guidance), along with the text prompt.
Guidance-Modulated Flow Matching (GMFM) Loss:
- The adapter is trained to predict the velocity vector field that transforms the source latent into the target latent.
- The loss function minimizes the difference between the adapter's predicted velocity and the high-CFG velocity predicted by the frozen VDM.
- Crucially, the high-CFG velocity acts as the "teacher," guiding the adapter to replicate the semantic transformation while preserving the motion structure of the source.

3. Key Contributions

Novel Training Pipeline: PropFly is the first method to train propagation-based video editing models without requiring offline paired datasets or auxiliary guidance signals (like optical flow or depth maps).
Efficient On-the-Fly Supervision: By utilizing one-step latent estimation and CFG modulation, the method generates infinite, diverse training pairs computationally efficiently, avoiding the cost of full iterative sampling.
GMFM Loss: A specialized loss function that effectively guides the adapter to learn the transformation between source and target latents, ensuring the model learns to propagate edits rather than just reconstructing the video.
Generalization: The approach supports a wide range of edits, from local object swaps to complex global transformations (weather, style, lighting), without task-specific fine-tuning.

4. Experimental Results

The authors evaluated PropFly on standard benchmarks (EditVerseBench-Appearance and TGVE) against state-of-the-art (SOTA) text-guided and propagation-based methods.

Quantitative Performance:
- PropFly (both 1.3B and 14B parameter versions) achieved SOTA performance across all metrics, including video quality (PickScore), text alignment (CLIP/ViCLIP), and temporal consistency (DINO/CLIP).
- It significantly outperformed propagation-based baselines like AnyV2V and Señorita-2M, as well as text-guided methods like STDF and TokenFlow.
Qualitative Performance:
- Temporal Consistency: PropFly successfully maintained the original motion dynamics while applying edits, whereas baselines often introduced motion artifacts or failed to propagate changes to moving objects.
- Complex Edits: It handled complex scenarios (e.g., changing a person to a robot in a cyberpunk city, altering weather conditions) with high fidelity, while baselines often failed to maintain object identity or background coherence.
Ablation Studies:
- One-step vs. Full Sampling: One-step estimation proved superior to full iterative sampling, which caused motion misalignment due to accumulated numerical errors in independent sampling paths.
- GMFM vs. Standard FM: The GMFM loss was essential; standard Flow Matching loss failed to propagate edits, causing the model to revert to the original video content.
- RSPF: Removing random style fusion led to poor generalization and failure to apply consistent styles across frames.

5. Significance

PropFly represents a paradigm shift in training video editing models. By decoupling the need for expensive, curated paired datasets and instead leveraging the inherent generative capabilities of pre-trained VDMs via CFG modulation, it offers a scalable, cost-effective, and highly generalizable solution.

Accessibility: It lowers the barrier to entry for training high-quality video editing models, as it does not require massive human-labeled datasets.
Flexibility: The "on-the-fly" nature allows the supervision signal to adapt to any text prompt or style, enabling the model to learn a vast distribution of edits.
Future Impact: This framework suggests a new direction for training generative models where the model itself acts as the supervisor, potentially applicable to other domains requiring structured data generation.