Imagine you are editing a home video of a family picnic. Suddenly, a stranger walks right through the middle of your shot, blocking the view of the kids playing. You want to delete them.

In the old days, if you tried to erase that person, you'd just paint over them with a blurry patch of grass. It would look fake. Worse, if that person was standing in the sun, their shadow would still be there on the grass, or if they were near a shiny car, their reflection would still be visible in the window. The video would look like a bad Photoshop job.

This paper introduces a new magic trick called EffectErase that solves this problem. Here is how it works, explained simply:

1. The Problem: The "Ghost" Effects

Previous video editors were like clumsy painters. They could remove the main object (the person), but they were blind to the "side effects" the object left behind.

The Shadow: Even after removing the person, their dark shadow remained on the ground.
The Reflection: If the person was near a window, their reflection stayed in the glass.
The Lighting: If the person was holding a flashlight, the beam of light stayed on the wall.

It's like trying to erase a stain from a shirt but leaving the shadow of the stain on the fabric underneath.

2. The Solution: A "Two-Way Street" (Removal & Insertion)

The researchers realized that to be good at erasing something, you first need to be good at adding it.

Think of it like a magic trick:

The Removal Task: "Take this person out of the video."
The Insertion Task: "Take this empty background and put a person back in."

The new system, EffectErase, learns both tasks at the same time. It's like a student who learns to bake a cake by also learning how to unbake it. By practicing putting things in, the AI learns exactly how shadows, reflections, and lighting work. This helps it understand exactly what to take out when it's doing the removal.

3. The New Training Ground: The "VOR" Dataset

To teach this AI, the researchers needed a massive library of examples. They couldn't just find these videos on YouTube because they need to know exactly what the scene looked like before and after the object was there.

So, they built VOR (Video Object Removal), a giant dataset with 60,000 video pairs:

Real Life: They set up cameras on tripods and filmed real scenes, first with an object, then without it.
Virtual World: They used 3D computer graphics to create fake worlds where they could perfectly control the shadows and reflections.

This is like a driving school that has both real roads and a perfect simulator, so the AI learns to handle rain, shadows, and weird angles.

4. How It Works: The "Spotlight"

The AI has a special module called Task-Aware Region Guidance. Imagine the AI has a flashlight.

When you ask it to remove a person, the flashlight doesn't just shine on the person. It shines on the person AND their shadow, their reflection, and the area where their body blocked the light.
It understands that the shadow is "connected" to the person, even though the shadow is on the ground and the person is in the air.

5. The Result

When you use EffectErase:

You draw a mask (a circle) around the object you want gone.
The AI doesn't just delete the circle. It deletes the person, the shadow, the reflection, and fixes the lighting.
The background looks like the person was never there at all. It's seamless, smooth, and realistic.

In short: Previous methods were like using a stamp to cover a stain. EffectErase is like rewinding time to before the stain happened, but only for that specific spot, fixing every ripple, shadow, and reflection perfectly.

Technical Summary: EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

1. Problem Definition

Video object removal aims to eliminate dynamic target objects from a video while restoring a seamless, realistic background. However, existing methods face two critical limitations:

Failure to Erase Secondary Effects: Current state-of-the-art (SOTA) methods often successfully remove the main object body but fail to erase associated visual effects such as shadows, reflections, lighting changes, occlusions, and deformations. These "side effects" are often left behind, resulting in unnatural or artifact-ridden outputs.
Lack of Comprehensive Data: There is a significant scarcity of large-scale, high-quality datasets that systematically capture the relationship between objects and their induced effects across diverse environments. Existing datasets are either image-based (lacking temporal consistency) or synthetic but limited in motion dynamics and effect diversity.

2. Methodology

A. The VOR Dataset (Video Object Removal)

To address the data gap, the authors introduce VOR, a large-scale hybrid dataset comprising 60,000 video pairs (145+ hours).

Composition: It combines real-world captured data (293 scenes using tripod-mounted cameras with Ken Burns effects) and 3D synthesized data (150+ diverse 3D scenes with rigged object motions and multi-camera setups).
Coverage: It covers five distinct effect types: Occlusion, Shadow, Lighting, Reflection, and Deformation.
Structure: Each entry consists of a triplet: (1) Video with object + effects, (2) Video without object (clean background), and (3) Corresponding object masks.
Benchmarks: Two evaluation sets are provided: VOR-Eval (with ground truth) and VOR-Wild (in-the-wild, no ground truth).

B. The EffectErase Framework

EffectErase is a joint learning framework built on a diffusion transformer backbone (based on Wan 2.1) that treats video object removal and insertion as reciprocal, inverse tasks.

Joint Removal-Insertion Learning:
- The model shares a common denoising backbone for both tasks.
- Removal: Input is the video with the object ( $V_o$ ) and mask ( $M$ ); the target is the clean background ( $V_b$ ).
- Insertion: Input is the clean background ( $V_b$ ) and object ( $V_f$ ); the target is the video with the object ( $V_o$ ).
- This dual-task setup forces the model to learn consistent structural cues and effect regions.
Task-Aware Region Guidance (TARG):
- A cross-attention mechanism that fuses task tokens (indicating "remove" or "insert") with foreground visual tokens (extracted via CLIP).
- This module explicitly models the spatiotemporal correlations between the target object and its induced effects, guiding the model to focus on the specific affected regions rather than just the object mask.
- It enables flexible switching between removal and insertion modes via task tokens.
Effect Consistency (EC) Loss:
- Since removal and insertion are inverse operations, they should identify the same affected regions (object + effects).
- The EC loss aligns the cross-attention maps of both branches. It minimizes the Kullback-Leibler (KL) divergence between the predicted effect regions and a difference map prior (derived from the pixel-wise difference between $V_o$ and $V_b$ ).
- This ensures that the model learns to localize and erase (or synthesize) effects consistently across both tasks.

3. Key Contributions

VOR Dataset: The first large-scale, high-quality hybrid dataset specifically designed for effect-aware video object removal, covering diverse real-world and synthetic scenarios with five specific effect categories.
EffectErase Architecture: A novel reciprocal learning framework that jointly optimizes removal and insertion. It introduces TARG to model object-effect correlations and EC Loss to enforce consistency in effect localization.
State-of-the-Art Performance: The method achieves superior results in both quantitative metrics (PSNR, SSIM, FVD) and qualitative visual quality, particularly in erasing complex secondary effects like shadows and reflections.
Dual Capability: The framework naturally extends to high-quality video object insertion, generating realistic effects (shadows, reflections) for inserted objects without additional training.

4. Experimental Results

Quantitative: On the ROSE-Benchmark and VOR-Eval, EffectErase outperforms SOTA methods (including ROSE, MinMax-Remover, ProPainter, and VACE) across all metrics. Notably, it achieves the lowest FVD (Fréchet Video Distance), indicating superior temporal consistency.
Qualitative: Visual comparisons show that while other methods leave behind shadows, reflections, or lighting artifacts, EffectErase successfully removes the target object and its associated environmental effects, resulting in clean, coherent backgrounds.
Ablation Studies:
- Removing the EC Loss leads to a significant drop in performance (FVD increases from 342.8 to 354.5), proving the necessity of consistency constraints.
- Removing TARG degrades SSIM, confirming its role in localizing effect regions.
- Training with Synthetic Data significantly improves generalization and background restoration quality.

5. Significance

EffectErase represents a paradigm shift in video editing by moving beyond simple "object removal" to "effect erasing." By explicitly modeling the physical interactions between objects and their environments (shadows, reflections, deformations), it solves a long-standing limitation in video inpainting. The introduction of the VOR dataset provides a crucial benchmark for future research, and the joint learning strategy offers a robust approach for handling complex, dynamic scenes in real-world applications such as film post-production and content creation.

EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing