VINCIE: Unlocking In-context Image Editing from Video

Imagine you want to edit a photo, but instead of just giving one command like "remove the cat," you want to have a conversation with the image. You say, "Add a cat." Then, "Make the cat wear a hat." Then, "Put the cat on a skateboard." Then, "Change the background to a jungle."

This is called In-Context Image Editing. The problem is, teaching a computer to do this is like trying to teach a dog to play chess by showing it only one move at a time. Most AI models today are trained on static "Before and After" pairs (like a photo of a messy room and a photo of a clean room). They learn to fix one thing, but they get confused when you ask them to do five things in a row, often forgetting the first instruction or messing up the picture.

Enter VINCIE (Video-driven IN-Context Image Editing). Here is the simple story of how they solved it.

The Big Idea: Stop Looking at Photos, Start Watching Movies

The researchers asked a simple question: "Why are we teaching image editing using static photos when the real world moves?"

Think of a photo as a single frozen frame. It tells you what something looks like.
Think of a video as a story. It tells you what happens next.

In a video, if a person walks out of a room, the camera sees the "Before" (person in room) and the "After" (empty room) naturally. If a car drives by, the video shows the transition. The video is a sequence of edits happening in real-time.

The Analogy:

Old Method: Trying to learn how to cook a 5-course meal by looking at a single photo of a burnt steak and a photo of a perfect steak. You don't know the steps in between.
VINCIE Method: Watching a cooking show (video). You see the chef chop, fry, season, and plate. You learn the flow of the changes, not just the start and end.

How They Built It (The "Magic Translator")

Since videos don't come with instructions like "Now I am removing the tree," the team had to teach the AI how to read the story of the video.

The Scriptwriter (VLM): They used a smart AI (a Vision-Language Model) to watch the video clips and write a "script." It looks at two frames and says, "Okay, in this second, the sun moved, and the dog jumped."
The Highlighter (Segmentation): They also taught the AI to draw a "highlighter" around exactly what changed. If the dog jumped, the AI draws a mask around the dog.
The Sequence: They turned the video into a long, interleaved chain:
- Image 1 -> Instruction: "Dog jumps" -> Mask of Dog -> Image 2 -> Instruction: "Sun sets" -> Mask of Sky -> Image 3.

The Three Training Games

To make the AI really good at this, they didn't just ask it to predict the next picture. They made it play three games simultaneously:

The "What's Next?" Game (Next Image Prediction): "Here is the scene and the instruction. What does the next frame look like?" (This is the main editing task).
The "Spot the Change" Game (Current Segmentation): "Here is the new picture. Can you circle exactly what changed?" (This helps the AI understand where to edit).
The "Crystal Ball" Game (Next Segmentation): "Here is the current scene. Where do you think the next change will happen?" (This helps the AI plan ahead, like a chess player).

Why This is a Game Changer

The results are impressive because the AI learned context.

No More "Drifting": In older models, if you edited a photo 5 times, the person's face might start to look like a potato by the 4th turn. VINCIE keeps the face looking like the same person because it learned from videos where people stay consistent even as they move.
Chain of Thought: The AI starts "thinking" before it acts. It predicts the "mask" (the area to change) before it generates the new pixels. It's like an artist sketching the outline before painting.
Emergent Skills: Because it learned from the "flow" of videos, it accidentally learned cool things it wasn't explicitly taught, like:
- Storytelling: It can generate a sequence of images that tell a coherent story (e.g., a character walking from a house to a mountain).
- Multi-Concept Mixing: It can combine a "cat," a "spaceship," and "jungle" in one go, even if it never saw those exact three things together in a video.

The Bottom Line

VINCIE is like teaching a child to edit photos by letting them watch a thousand hours of movies instead of showing them a stack of "Before and After" flashcards. By learning from the natural motion and changes in video, the AI understands the logic of editing, not just the result.

It's scalable (you can find infinite videos on the internet) and it creates a model that can handle long, complex editing conversations without getting confused or losing the plot.

1. Problem Statement

In-context image editing aims to modify an image based on a contextual sequence comprising text prompts and previously generated images, enabling multi-turn interactions where users iteratively refine an image while maintaining visual consistency.

Key Challenges:

Data Scarcity: Existing methods rely on task-specific pipelines to curate high-quality "before-and-after" image pairs. These pipelines often struggle to construct meaningful long-form content that captures the dependencies and evolving intent across multiple editing steps.
Contextual Dependency: Single-turn editing models fail to maintain consistency over multiple turns, leading to artifact accumulation and loss of visual coherence.
Synthetic Data Limitations: Current approaches often depend on synthetic data generation or web-scraped pairs, which may lack the natural temporal dynamics and semantic coherence found in real-world visual transitions.

The authors pose the research question: Can a meaningful in-context image editing model be learned solely from videos, without using any standalone image pairs?

2. Methodology: VINCIE

The authors propose VINCIE (Video-driven IN-Context Image Editing), a framework that learns transitions directly from native video data.

A. Data Construction: Interleaved Multimodal Sequences

Instead of creating static image pairs, the authors transform raw video data into interleaved multimodal sequences.

Frame Sampling: Videos are sparsely sampled using two strategies: Equal-interval sampling (fixed time intervals) and Fixed-frame sampling (uniform number of frames) to capture both subtle object changes and significant scene transitions.
Visual Transition Annotation: A Vision-Language Model (VLM) is used to generate textual descriptions ( $T_i$ ) of the visual transitions between adjacent frames ( $I_i \to I_{i+1}$ ). This involves a Chain-of-Thought (CoT) process to describe scene details, identify differences, and summarize them into concise editing instructions.
Segmentation Annotation: Using the textual descriptions, Grounding-DINO and SAM 2 are employed to generate segmentation masks ( $M_i, M_{i+1}$ ) for the Regions of Editing (RoEs).
Sequence Formation: The final training sample is an interleaved sequence: $(I_0, T_0, M_0, I_1, T_1, M_1, \dots, I_K)$ , capturing the context, instructions, and spatial boundaries of changes.

B. Model Architecture

Base Model: A Diffusion Transformer (DiT) initialized from a video foundation model (MM-DiT).
Input Representation: The model processes the interleaved sequence using latent tokens from a VAE (images/masks) and a text encoder (T5). Special learnable <TURN> tokens mark the boundaries between editing steps.
Attention Mechanisms: Two variants are explored:
1. Full Attention: Comprehensive token interaction across all modalities.
2. Block-wise Causal Attention: Enforces causality across blocks (e.g., text $\to$ image) while allowing bidirectional attention within a block, improving efficiency.
Training Objective: The model is trained to maximize the likelihood of the next image prediction using flow-matching in the latent space.

C. Three Proxy Tasks

To enhance the model's understanding of context and spatial changes, three tasks are jointly optimized:

Next Image Prediction (NIP): The primary task; predicting the next frame image given the history.
Current Segmentation Prediction (CSP): Predicting the segmentation mask of the current frame to ground the model in understanding what has changed.
Next Segmentation Prediction (NSP): Predicting the segmentation mask of the next frame to anticipate where changes will occur, aiding in dynamic layout adjustments.

3. Key Contributions

Video-Only Training Paradigm: This is the first work to demonstrate that a robust in-context image editing model can be learned exclusively from native video data, eliminating the need for manually curated image pairs.
Scalable Data Pipeline: A novel pipeline that converts vast amounts of web video into interleaved multimodal sequences using VLMs and segmentation models, enabling training on 10M session instances.
MSE-Bench: The authors introduce MSE-Bench (Multi-turn Session image Editing Benchmark), a new dataset with 100 coherent 5-turn editing sessions covering complex scenarios (posture, interaction, camera views) that existing benchmarks (like MagicBrush) lack.
Emergent Capabilities: The model exhibits emergent abilities not explicitly trained, including:
- Multi-concept Composition: Combining multiple concepts without specific training data.
- Story Generation: Generating coherent frames for storytelling.
- Chain-of-Editing: Implicitly modeling a multimodal chain of thought (Instruction $\to$ Mask $\to$ Image).

4. Experimental Results

The model was evaluated on MagicBrush and the proposed MSE-Bench.

Performance on MagicBrush: VINCIE achieves performance comparable to state-of-the-art (SOTA) methods (e.g., UltraEdit, OmniGen) that rely on pairwise editing data. With Supervised Fine-Tuning (SFT), it outperforms nearly all baselines, particularly in later editing turns.
Performance on MSE-Bench (5-Turn Editing):
- Existing academic methods struggle significantly, with success rates dropping below 2% at Turn-5.
- VINCIE achieves a 25% success rate at Turn-5 (3B model) and up to 48.7% with SFT (7B model).
- While still behind proprietary models like GPT-4o (~62.7%), VINCIE demonstrates that video data provides a strong foundation for long-horizon consistency.
Scalability: Scaling training data from 0.25M to 10M sessions increased the Turn-5 success rate from 5% to 22%, showing a log-linear improvement.
Artifact Mitigation: The study confirms that in-context editing (using full history) significantly reduces artifact accumulation compared to sequential single-turn editing.
Ablation Studies: Training with segmentation prediction (CSP/NSP) significantly improves consistency and success rates. The "Chain-of-Editing" strategy (predicting masks before images) effectively mitigates subject position shifts common in video-derived training.

5. Significance and Impact

Democratization of Data: By leveraging native video, the approach bypasses the bottleneck of expensive, manual image-pair curation, offering a scalable path to training powerful editing models.
Temporal Consistency: Learning from videos inherently teaches the model the dynamics of object permanence, motion, and scene evolution, leading to superior multi-turn consistency compared to static-image-trained models.
New Research Direction: The work validates the hypothesis that video data contains rich, implicit "editing logic" (additions, removals, modifications) that can be distilled into generative models, opening new avenues for multimodal chain-of-thought and long-form visual creation.

In conclusion, VINCIE establishes a new paradigm for image editing, proving that the temporal dynamics of video are a superior and scalable source for learning complex, multi-turn editing tasks.