Object-WIPER : Training-Free Object and Associated Effect Removal in Videos

Imagine you are watching a home video of a family picnic. Suddenly, you notice a boom microphone (the long pole with a fuzzy windscreen) dipping into the frame, or perhaps a crew member's shadow is stretching across the happy family. You want to edit the video to make it look like the picnic was completely private and untouched.

This is the problem Object-WIPER solves. It's a new "magic eraser" for videos that doesn't just delete the unwanted person or object; it also deletes their shadows, reflections, and the weird ripples they cause, all without needing to be retrained on new data.

Here is how it works, broken down into simple concepts and analogies:

1. The Problem: The "Ghost" Effect

Most video editing tools today are like a clumsy painter. If you tell them to paint over a person, they might fill in the person's body with background grass. But they often forget to paint over the person's shadow or their reflection in a puddle.

The result? You end up with a floating shadow or a reflection of nothing, which looks like a ghost haunting your video. Previous AI tools either needed massive amounts of training data (like a student studying for years) or they just couldn't find these "ghosts" (shadows/reflections) to remove them.

2. The Solution: Object-WIPER (The Smart Detective)

Object-WIPER is a "training-free" tool. Think of it as a detective who already knows how the world works because they studied a massive library of movies and videos before they started working. They don't need to go to school again; they just use their existing knowledge to solve the case.

It works in three main steps:

Step A: Finding the "Hidden" Parts (The Detective's Magnifying Glass)

You give the AI a mask (a rough outline) of the object you want to remove, like a duck. You also tell it, "Remove the duck and its reflection."

Instead of just looking at the duck, the AI uses a special "attention" mechanism. Imagine the AI is reading a script (the text prompt) while looking at the video. It asks the video: "Hey, which pixels are talking about the 'duck' and the 'reflection'?"

Cross-Attention: It finds the pixels that match the word "duck."
Self-Attention: It then looks at those pixels and asks, "Who are you friends with?" It realizes the pixels representing the "reflection" are hanging out with the "duck" pixels.
Result: It creates a perfect, expanded mask that includes the duck and its reflection, filling in the gaps that a human might miss.

Step B: The "Rewind and Reset" (The Time Traveler)

Once it knows exactly what to remove, it performs a magic trick called Inversion.

Imagine the video is a complex puzzle. The AI "rewinds" the video back to a state of pure static noise (like TV snow), but it carefully saves the parts of the puzzle that represent the background (the grass, the sky).
It then takes the "duck" part of the puzzle and throws it away, replacing it with fresh, random static noise.
Crucial Move: As it starts rebuilding the video (Denoising), it forces the new "duck" area to look at the saved "background" pieces for inspiration. It tells the new pixels: "Don't look at the old duck; look at the grass behind it and copy that."

Step C: The "Smart Paint" (The Art Restorer)

As the video rebuilds itself from the noise, the AI uses a technique called Attention Scaling.

Think of this as a volume knob. During the early stages of rebuilding, it turns the volume down on the "duck" area so it doesn't accidentally bring the duck back. It turns the volume up on the "background" area so the grass and sky flow naturally into the empty space.
By the time the video is finished, the duck and its reflection are gone, replaced by a seamless, realistic background.

3. The New Scorecard: TokSim

The paper also points out a funny problem with how we grade these tools. Usually, we use metrics like "PSNR" (which measures how close the pixels are to the original).

The Flaw: If an AI just copies the original video and doesn't remove the duck at all, the score is perfect! It's like a student copying the test answers and getting an A, even though they didn't solve the problem.
The Fix: The authors created a new score called TokSim. Instead of just counting pixels, it checks:
1. Did the background look consistent over time? (No flickering ghosts).
2. Does the new background look like it belongs there?
3. Is the duck actually gone?
  If the duck is still there, the score crashes. If the duck is gone and the grass looks real, the score goes up.

Why This Matters

No Training Needed: You don't need a supercomputer to train this model. It works out of the box.
Real-World Ready: It handles tricky stuff like translucent objects (glass), mirrors, and moving shadows, which previous tools failed at.
Clean Results: It doesn't just cut out the object; it heals the wound so the video looks like the object was never there.

In summary: Object-WIPER is like a highly skilled film editor who can look at a messy scene, identify not just the actor but their shadow and reflection, and then seamlessly "paint over" the entire mess with the background, making it look like the scene was always perfect. And it does all this without needing to go back to film school.

1. Problem Statement

Video object removal is a critical task for film production, privacy protection, and creative content generation. While existing methods can remove objects, they often fail to remove associated effects such as shadows, reflections, mirror images, and translucency.

Limitations of Current Methods:
- Training-based methods (e.g., ROSE, GenProp): Require massive amounts of synthetic data and expensive fine-tuning. They often retain associated effects or leave artifacts.
- Training-free methods (e.g., Omnimatte-Zero, KV-Edit): Often rely on external tracking models (like TAP-Net) which fail with fast motion or textureless objects. They typically use a fixed user-provided mask, missing subtle associated effects that lie outside the object boundary.
- Evaluation Metrics: Standard metrics like PSNR or Video Quality scores are insufficient because they can yield high scores even if the object is not fully removed or if the background is distorted.

Goal: Develop a training-free framework that removes both the target object and its complex associated effects (shadows, reflections, etc.) while maintaining temporal coherence and background fidelity, without requiring retraining or external tracking models.

2. Methodology: Object-WIPER

The proposed framework leverages a pre-trained Text-to-Video Diffusion Transformer (DiT) (specifically Hunyuan-T2V) and operates in three main stages:

A. Associated Effects Localization

Instead of relying solely on the user-provided object mask ( $M_{obj}$ ), the method dynamically localizes associated effects using the internal attention mechanisms of the DiT.

Text-to-Visual Cross-Attention ( $T \to I$ ): The user provides text prompts describing the object (e.g., "duck") and its effect (e.g., "reflection"). The model computes cross-attention scores between these text tokens and visual tokens. This generates a proposal mask ( $m_{PRO}$ ) highlighting regions highly correlated with the text.
Visual Self-Attention Refinement ( $I \to I$ ): The proposal mask often contains internal holes. The method uses visual self-attention maps to identify tokens that strongly attend to the already identified object tokens. This refines the mask into a dense Associated Effect Mask ( $M_{AE}$ ).
Final Mask: The final region to be inpainted is the union of the user mask and the computed effect mask: $M_{final} = M_{obj} \cup M_{AE}$ .

B. Inversion with Time-Adaptive Masking

The input video is inverted back to a noise latent space using RF-Solver.

Background Preservation: During inversion, the model saves the value features ( $V_I$ ) of the background tokens for the last $k$ timesteps.
Time-Adaptive Masking: As noise increases during inversion, the "footprint" of the object expands in the attention space. A fixed mask fails to cover this expansion. The method computes an Object Response Score (RS) based on self-attention to generate a time-adaptive mask ( $\hat{M}_{obj}^t$ ). This ensures that as the object's influence spreads in the latent space, the mask expands to cover it, preventing "object leakage" during the subsequent denoising.

C. Denoising with Re-initialization and Attention Scaling

The video is reconstructed from noise, but with specific constraints to ensure the object is replaced by background content.

Gaussian Re-initialization: The masked region (object + effects) in the noisy latent is re-initialized with Gaussian noise, effectively erasing any prior structural information about the object.
Attention Scaling:
- Inversion Phase: Background tokens are biased to attend less to the object tokens ( $c < 1$ ) to prevent the background from learning the object's features.
- Denoising Phase: Object tokens are biased to attend more to the background tokens ( $b > 1$ ). This forces the model to reconstruct the masked region using information from the surrounding background, ensuring semantic consistency.
Background Copying: During early denoising steps (when global structure forms), the saved background value features are copied back to the non-masked regions to maintain scene fidelity.

3. Key Contributions

Training-Free Framework: Object-WIPER achieves high-quality removal without fine-tuning, saving significant computational resources.
Novel Localization Strategy: It introduces a two-step attention-based localization (Cross-Attention + Self-Attention) to automatically detect and mask associated effects (shadows, reflections) without external tracking models.
Time-Adaptive Masking & Attention Scaling: A novel strategy to handle the expansion of object influence in latent space and force the model to "look" at the background for inpainting, preventing object leakage.
New Evaluation Metric (TokSim):
- Existing metrics fail to distinguish between "partially removed" and "fully removed" objects.
- TokSim (Token Similarity) evaluates object removal based on three factors:
  1. Temporal Consistency: Similarity of foreground tokens across consecutive frames.
  2. Background Coherence: Similarity between foreground and background tokens within a frame.
  3. Dissimilarity: Dissimilarity between the input object tokens and the output foreground tokens.
- It heavily penalizes partial removal and rewards clean, coherent inpainting.
WIPER-Bench: A new real-world benchmark dataset containing 60 videos with diverse, complex associated effects (translucency, mirrors, disconnected shadows) curated from Pexels and YouTube.

4. Results

Quantitative Performance:
- On DAVIS and WIPER-Bench, Object-WIPER outperforms all baselines (including training-based methods like ROSE and GenProp) on the TokSim metric.
- It achieves competitive or superior results on traditional metrics (BG-PSNR, Video Quality) while significantly outperforming them in removing associated effects.
- It demonstrates high temporal stability (low FG-Flicker) compared to frame-wise methods.
Qualitative Performance:
- Successfully removes objects and their associated effects (e.g., a duck and its water reflection, a person and their shadow) where other methods leave artifacts or retain the effect.
- Handles fast motion and translucent objects better than methods relying on external point trackers.
Ablation Studies:
- Removing the Attention Scaling leads to incoherent fills.
- Removing Time-Adaptive Masking causes leakage in fast-motion scenarios.
- Removing the $M_{AE}$ mask fails to remove associated effects.

5. Significance

Object-WIPER represents a significant leap in training-free video editing. By leveraging the internal attention mechanisms of large diffusion models, it solves the long-standing problem of removing "invisible" associated effects (shadows/reflections) without the need for expensive data collection or model retraining. The introduction of TokSim addresses a critical gap in the field, providing a metric that actually measures the success of object removal rather than just background preservation. The release of WIPER-Bench sets a new standard for evaluating future video editing algorithms on real-world, complex scenarios.