Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Imagine you want to edit a home video. Maybe you want to change the background from a boring living room to a bustling Paris café, or swap your friend's t-shirt for a cool superhero costume.

In the past, doing this with AI was like trying to describe a painting to a blind artist using only words. You'd say, "Make the shirt blue," but the AI might make it a weird shade of teal, or change the whole person's face. It struggled with the details.

Then, some smart researchers tried a new trick: Show, don't just tell. They said, "Here is a picture of the exact blue shirt I want; now put it on the person." This worked much better, but there was a huge problem: Nobody had enough examples to teach the AI how to do this. It's like trying to teach a student to drive without a library of driving manuals.

Enter Kiwi-Edit, a new project from researchers at the National University of Singapore. They solved the "no manuals" problem and built a super-smart video editor. Here's how it works, broken down into simple analogies:

1. The Problem: The "Missing Manual"

To teach an AI to edit videos using a reference picture (like a photo of a hat you want to add), you need a massive library of training examples. Each example needs four things:

The Original Video.
The Instruction ("Add a hat").
The Reference Picture (A photo of the hat).
The Final Edited Video.

Existing datasets had 1, 2, or 3 of these, but never all 4 together. The AI was stuck because it had never seen the "Reference Picture" paired with the result before.

2. The Solution: The "Magic Photocopier" Pipeline

The team didn't hire thousands of people to manually create these examples (which would take forever and cost a fortune). Instead, they built an automated "Magic Photocopier" pipeline.

Step 1: They took millions of existing video editing examples (where someone just used text instructions).
Step 2: They used a super-smart AI to look at the "Before" and "After" videos and figure out exactly what changed.
Step 3: They used another AI to "reverse engineer" the change. If the video changed a red car to a blue truck, the AI generated a clean, high-quality picture of just that blue truck.
Step 4: They filtered out the bad copies and kept the best 477,000 examples.

The Analogy: Imagine you have a million photos of people changing clothes. You don't have a photo of the new shirt. But, you use a magic machine that looks at the person in the new shirt, cuts them out, and prints a perfect photo of just the shirt. Now you have a "Reference Photo" for every single example. Boom! You have a massive library to teach the AI.

3. The Brain: The "Conductor and the Orchestra"

Once they had the data, they built Kiwi-Edit, the actual video editing model. Think of it like a symphony orchestra:

The Conductor (The MLLM): This is a giant AI brain that understands language and images. It listens to your command ("Put a fedora on the woman") and looks at your reference photo. It doesn't paint the video itself; it just tells the orchestra what to play.
The Orchestra (The Diffusion Transformer): This is the engine that actually generates the video frames. It's very good at making videos, but it needs clear directions.
The Score (The Connectors): The Conductor uses two special tools to talk to the Orchestra:
1. The Query Token: A shorthand note saying "Do the action!" (e.g., "Add hat").
2. The Latent Token: A direct copy of the reference photo's "DNA" (texture, color, shape) so the Orchestra knows exactly what the hat looks like.

The Secret Sauce: To make sure the video doesn't look like a glitchy mess, they use a "Hybrid Injection" strategy.

They keep the original video's structure (the movement, the timing) locked in tight, like a skeleton.
They then "paint" the new details (the hat, the background) over the top using the reference photo.
Analogy: It's like taking a video of a person walking and using a projector to shine a new image of a hat onto their head. The person keeps walking naturally (the skeleton), but the hat looks exactly like the one in your photo (the projection).

4. The Training: The "School Curriculum"

You can't just throw a baby into a swimming pool and expect them to swim. The team trained Kiwi-Edit in three stages, like a school curriculum:

Kindergarten (Alignment): Teaching the Conductor and the Orchestra how to speak the same language using simple image editing tasks.
Elementary (Instruction Tuning): Teaching the model to follow text commands for videos (e.g., "Change the sky to sunset").
Graduate School (Reference Fine-Tuning): Finally, introducing the massive new dataset with reference photos. This is where the model learns to say, "Oh, you want this specific hat, not just any hat."

5. The Result: A New Standard

They tested Kiwi-Edit against other top models (including some from big tech companies).

Instruction Only: It was better at following text commands than almost anyone else.
Reference Guided: When given a photo to copy, it was the best open-source model available, rivaling expensive, closed-source commercial tools.

In a Nutshell:
Kiwi-Edit is like giving a video editor a "Show and Tell" superpower. Instead of guessing what you mean by "cool hat," you can show it a picture of the hat, and it will perfectly paste that hat onto the video while keeping the person's movement natural. They did this by inventing a way to automatically create millions of "Show and Tell" examples and building a model smart enough to learn from them.

Where to find it:
The researchers made everything free. You can download the dataset, the code, and the model from their GitHub page to try it out yourself!

1. Problem Statement

Current instruction-based video editing methods rely heavily on natural language prompts. While effective for broad changes, natural language is inherently ambiguous when describing precise visual nuances (e.g., specific textures, exact object identities, or complex stylistic details). Users often need to convey editing intent via visual examples (e.g., "replace the car with this sports car"), but existing models struggle to interpret such visual references.

The primary bottleneck in developing Reference-Guided Video Editing is the scarcity of high-quality training data. Training such models requires "quadruplets" consisting of:

Source Video ( $V_{src}$ )
Editing Instruction ( $T_{inst}$ )
Reference Image ( $I_{ref}$ )
Target Video ( $V_{tgt}$ )

Existing large-scale datasets provide triplets ( $V_{src}, T_{inst}, V_{tgt}$ ) but lack the reference image component. Proprietary datasets exist but are not open to the research community, stalling progress in this field.

2. Methodology

The authors propose a comprehensive solution comprising a scalable data generation pipeline, a new dataset, a benchmark, and a unified model architecture.

A. RefVIE Dataset Construction (Scalable Data Pipeline)

To overcome data scarcity, the authors developed an automated pipeline to synthesize reference images from existing instruction-based video editing datasets (Ditto, ReCo, OpenVE).

Input: 3.7M raw samples from open-source datasets.
Process:
1. Filtering: Samples are filtered using EditScore to ensure high quality.
2. Grounding & Segmentation: A Vision-Language Model (Qwen3-VL) identifies the region of interest in the target video based on the instruction. SAM3 refines this into pixel-perfect masks.
3. Reference Synthesis: An image editing model (Qwen-Image-Edit) generates the reference image.
  - For Background Changes: The foreground is removed, and the background is inpainted to create a clean reference.
  - For Local Edits: The target object is extracted and placed on a clean background.
4. Quality Control: An MLLM verifies semantic alignment between the synthesized reference and the target video. CLIP features are used for de-duplication.
Output: RefVIE, a large-scale open-source dataset of 477K high-quality quadruplets.

B. Kiwi-Edit Architecture

The model is a unified framework integrating a Multimodal Large Language Model (MLLM) with a Diffusion Transformer (DiT).

Backbone: Frozen Qwen2.5-VL-3B (MLLM) and Wan2.2-TI2V-5B (DiT).
Dual-Connector Mechanism:
1. Query Connector: Projects learnable query tokens to distill high-level editing intent from the instruction.
2. Latent Connector: Extracts visual features from the reference image to provide dense semantic priors.
- These are concatenated into Context Tokens for the DiT's cross-attention layers.
Hybrid Latent Injection Strategy:
- Source Video Control: Source video latents are added element-wise to the noisy latent $z_t$ , modulated by a learnable timestep-dependent scalar $\gamma(t)$ . This preserves structural layout and temporal coherence.
- Reference Control: Reference image features are concatenated to the input sequence, allowing the model to "copy" fine-grained textures and details directly.
Training Objective: Flow Matching to minimize the error between the predicted velocity field and the ground truth.

C. Progressive Training Curriculum

To ensure stable convergence, a three-stage training strategy is employed:

MLLM-DiT Alignment: Aligns the semantic space of the MLLM and DiT using text-based image editing data.
Instructional Tuning: Joint optimization on text-based video editing triplets (low-to-high resolution curriculum).
Reference-Guided Fine-tuning: Trains on the RefVIE quadruplets to unlock precise visual control using reference images.

D. RefVIE-Bench

A new benchmark comprising 110 manually verified samples designed to evaluate:

Subject Reference: Identity consistency, temporal fidelity, and physical integration.
Background Replacement: Reference fidelity, matting quality, and visual harmony.
Evaluation: Uses an automated MLLM judge (Gemini) with a hierarchical scoring system to prevent high scores on semantically incorrect but temporally stable edits.

3. Key Contributions

RefVIE Dataset: The first large-scale, open-source dataset (477K quadruplets) specifically for instruction-reference guided video editing.
Kiwi-Edit Model: A unified architecture that synergizes learnable queries and latent visual features, achieving state-of-the-art (SOTA) performance in both instruction-only and reference-guided tasks.
RefVIE-Bench: A rigorous evaluation benchmark addressing the lack of standardized metrics for reference adherence in video editing.
Data-Centric Approach: Demonstrates that synthesizing high-fidelity reference scaffolds via generative models can effectively bridge the data gap in multimodal video editing.

4. Experimental Results

Instruction-Only Performance: On the OpenVE-Bench, Kiwi-Edit achieved an overall score of 3.02, surpassing the previous best open-source model (OpenVE-Edit, 2.50) and the proprietary Runway Aleph (2.62) in background change tasks.
Reference-Guided Performance: On RefVIE-Bench, Kiwi-Edit achieved an overall score of 3.31, slightly outperforming the proprietary Runway Aleph (3.29) and demonstrating competitive Identity Consistency (3.98) and Reference Similarity (3.72).
Ablation Studies:
- Architecture: Element-wise addition with timestep scaling for source video was proven superior to channel concatenation.
- Training: Skipping the alignment stage caused catastrophic failure; excluding image co-training degraded structural tasks (e.g., object removal).
- Connectors: Combining sparse instructional queries with dense reference latents significantly improved reference adherence scores (3.20 $\to$ 3.30).

5. Significance

Kiwi-Edit represents a major leap forward in controllable video generation. By solving the data scarcity problem through a scalable synthesis pipeline, it democratizes access to reference-guided editing capabilities previously limited to proprietary, closed-source models. The work establishes a new standard for how multimodal conditions (text + image) can be integrated into diffusion models to achieve high-fidelity, temporally consistent video editing, paving the way for more accessible and precise content creation tools for the research community and industry.