PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

PropFly introduces a training pipeline for propagation-based video editing that eliminates the need for costly paired datasets by leveraging on-the-fly supervision from pre-trained video diffusion models to synthesize diverse source-edited latent pairs via varying CFG scales, enabling an adapter to learn high-quality, temporally consistent edits through Guidance-Modulated Flow Matching.

Wonyong Seo, Jaeho Moon, Jaehyup Lee, Soo Ye Kim, Munchurl Kim

Published 2026-02-25
📖 5 min read🧠 Deep dive

Imagine you have a home video of your dog running through a park. You want to edit it so the dog looks like a dragon, but you want the dragon to run exactly the same way the dog did, with the same background and the same wind blowing through the trees.

Doing this manually is a nightmare. Doing it with current AI is like trying to describe the dragon to a painter who keeps forgetting what you said, or who paints the dragon but forgets to make it run.

PropFly is a new method that solves this problem. Here is how it works, explained with simple analogies:

1. The Problem: The "Pair" Dilemma

To teach an AI to edit videos, you usually need a massive library of "Before and After" pairs. You need thousands of videos showing a "normal car" and the exact same video showing a "cyberpunk car."

  • The Issue: Making these pairs is incredibly expensive and slow. It's like hiring an army of artists to redraw every single frame of every movie to create a training dataset.

2. The Solution: The "Instant Tutor" (On-the-Fly Supervision)

Instead of hiring an army of artists to pre-make the training data, PropFly uses a pre-trained AI (a "Video Diffusion Model") as a live tutor.

Think of this pre-trained AI as a master chef who knows how to cook any dish perfectly.

  • Old Way: You ask the chef to cook 10,000 meals, save them, and then try to learn from them.
  • PropFly Way: You ask the chef to cook a meal right now, but you ask for two versions of it instantly:
    1. Version A (The Source): A plain, standard dish (e.g., a plain burger).
    2. Version B (The Target): A fancy, spicy version of the exact same dish (e.g., a spicy burger).

The magic is that the chef creates both versions from the same ingredients at the exact same moment. Because they come from the same "source," the bun, the patty, and the bun's shape are identical. The only difference is the spice level (the style or object change).

3. The Secret Sauce: The "Volume Knob" (CFG)

How does the chef make two different dishes instantly?
PropFly uses a setting called Classifier-Free Guidance (CFG), which acts like a volume knob for creativity.

  • Low Volume (Low CFG): The AI follows the instructions strictly but stays very close to the original video. It's the "Source."
  • High Volume (High CFG): The AI turns up the creativity. It takes the same video and applies a heavy "style filter" (e.g., turning it into a painting, changing the weather, or swapping the object). This is the "Target."

Because the AI generates both from the same "noisy" starting point, the motion (the dog running) stays perfectly synchronized, but the look (dog vs. dragon) changes.

4. The Student: The "Adapter"

PropFly doesn't retrain the whole master chef (the big AI model). Instead, it attaches a small, trainable adapter (a "student") to the chef.

  • The student watches the Source (the plain video) and the Target (the spicy video).
  • The student learns: "Oh, when I see a plain burger and the user wants a spicy one, I need to add these specific spices to the whole video, not just the first frame."
  • The student learns to propagate the change. It learns how to take the "spicy" look from the first frame and apply it to every single frame that follows, keeping the motion smooth.

5. The Result: Robust Editing

Because the student learned from these "instantly generated" pairs, it becomes a master at editing.

  • Text-Guided Methods (The Old Way): You tell the AI "Make it a dragon," and it might make a dragon that walks like a human or forgets the background.
  • PropFly: You show it one frame of a dragon, and it says, "Got it! I know exactly how to turn the whole video into a dragon while keeping the dog's running motion perfectly intact."

Summary Analogy

Imagine you are learning to paint a moving car.

  • Old Method: You are given a photo of a red car and a photo of a blue car, but they are slightly different angles. You have to guess how to paint the blue car moving.
  • PropFly: You have a magic camera. You take a picture of a red car. You turn a dial, and instantly, the camera shows you the same car, in the exact same pose, but painted blue. You practice this thousands of times in a row. Now, you can look at any red car video and instantly know how to paint it blue while keeping the wheels spinning perfectly.

In short: PropFly teaches an AI to edit videos by letting it practice on "instantly generated" examples, so it doesn't need a massive, expensive library of pre-made examples to learn how to be creative and consistent.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →