Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Kiwi-Edit addresses the limitations of instruction-based video editing and the scarcity of reference-guided training data by introducing a scalable data generation pipeline to create the RefVIE dataset and a unified architecture that synergizes learnable queries with latent visual features to achieve state-of-the-art controllable video editing.

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you want to edit a home video. Maybe you want to change the background from a boring living room to a bustling Paris café, or swap your friend's t-shirt for a cool superhero costume.

In the past, doing this with AI was like trying to describe a painting to a blind artist using only words. You'd say, "Make the shirt blue," but the AI might make it a weird shade of teal, or change the whole person's face. It struggled with the details.

Then, some smart researchers tried a new trick: Show, don't just tell. They said, "Here is a picture of the exact blue shirt I want; now put it on the person." This worked much better, but there was a huge problem: Nobody had enough examples to teach the AI how to do this. It's like trying to teach a student to drive without a library of driving manuals.

Enter Kiwi-Edit, a new project from researchers at the National University of Singapore. They solved the "no manuals" problem and built a super-smart video editor. Here's how it works, broken down into simple analogies:

1. The Problem: The "Missing Manual"

To teach an AI to edit videos using a reference picture (like a photo of a hat you want to add), you need a massive library of training examples. Each example needs four things:

  1. The Original Video.
  2. The Instruction ("Add a hat").
  3. The Reference Picture (A photo of the hat).
  4. The Final Edited Video.

Existing datasets had 1, 2, or 3 of these, but never all 4 together. The AI was stuck because it had never seen the "Reference Picture" paired with the result before.

2. The Solution: The "Magic Photocopier" Pipeline

The team didn't hire thousands of people to manually create these examples (which would take forever and cost a fortune). Instead, they built an automated "Magic Photocopier" pipeline.

  • Step 1: They took millions of existing video editing examples (where someone just used text instructions).
  • Step 2: They used a super-smart AI to look at the "Before" and "After" videos and figure out exactly what changed.
  • Step 3: They used another AI to "reverse engineer" the change. If the video changed a red car to a blue truck, the AI generated a clean, high-quality picture of just that blue truck.
  • Step 4: They filtered out the bad copies and kept the best 477,000 examples.

The Analogy: Imagine you have a million photos of people changing clothes. You don't have a photo of the new shirt. But, you use a magic machine that looks at the person in the new shirt, cuts them out, and prints a perfect photo of just the shirt. Now you have a "Reference Photo" for every single example. Boom! You have a massive library to teach the AI.

3. The Brain: The "Conductor and the Orchestra"

Once they had the data, they built Kiwi-Edit, the actual video editing model. Think of it like a symphony orchestra:

  • The Conductor (The MLLM): This is a giant AI brain that understands language and images. It listens to your command ("Put a fedora on the woman") and looks at your reference photo. It doesn't paint the video itself; it just tells the orchestra what to play.
  • The Orchestra (The Diffusion Transformer): This is the engine that actually generates the video frames. It's very good at making videos, but it needs clear directions.
  • The Score (The Connectors): The Conductor uses two special tools to talk to the Orchestra:
    1. The Query Token: A shorthand note saying "Do the action!" (e.g., "Add hat").
    2. The Latent Token: A direct copy of the reference photo's "DNA" (texture, color, shape) so the Orchestra knows exactly what the hat looks like.

The Secret Sauce: To make sure the video doesn't look like a glitchy mess, they use a "Hybrid Injection" strategy.

  • They keep the original video's structure (the movement, the timing) locked in tight, like a skeleton.
  • They then "paint" the new details (the hat, the background) over the top using the reference photo.
  • Analogy: It's like taking a video of a person walking and using a projector to shine a new image of a hat onto their head. The person keeps walking naturally (the skeleton), but the hat looks exactly like the one in your photo (the projection).

4. The Training: The "School Curriculum"

You can't just throw a baby into a swimming pool and expect them to swim. The team trained Kiwi-Edit in three stages, like a school curriculum:

  1. Kindergarten (Alignment): Teaching the Conductor and the Orchestra how to speak the same language using simple image editing tasks.
  2. Elementary (Instruction Tuning): Teaching the model to follow text commands for videos (e.g., "Change the sky to sunset").
  3. Graduate School (Reference Fine-Tuning): Finally, introducing the massive new dataset with reference photos. This is where the model learns to say, "Oh, you want this specific hat, not just any hat."

5. The Result: A New Standard

They tested Kiwi-Edit against other top models (including some from big tech companies).

  • Instruction Only: It was better at following text commands than almost anyone else.
  • Reference Guided: When given a photo to copy, it was the best open-source model available, rivaling expensive, closed-source commercial tools.

In a Nutshell:
Kiwi-Edit is like giving a video editor a "Show and Tell" superpower. Instead of guessing what you mean by "cool hat," you can show it a picture of the hat, and it will perfectly paste that hat onto the video while keeping the person's movement natural. They did this by inventing a way to automatically create millions of "Show and Tell" examples and building a model smart enough to learn from them.

Where to find it:
The researchers made everything free. You can download the dataset, the code, and the model from their GitHub page to try it out yourself!