Object-WIPER : Training-Free Object and Associated Effect Removal in Videos

The paper introduces Object-WIPER, a training-free framework that leverages pre-trained text-to-video diffusion transformers to remove dynamic objects and their associated visual effects from videos while ensuring semantically consistent and temporally coherent inpainting.

Saksham Singh Kushwaha, Sayan Nag, Yapeng Tian, Kuldeep Kulkarni

Published 2026-02-24
📖 5 min read🧠 Deep dive

Imagine you are watching a home video of a family picnic. Suddenly, you notice a boom microphone (the long pole with a fuzzy windscreen) dipping into the frame, or perhaps a crew member's shadow is stretching across the happy family. You want to edit the video to make it look like the picnic was completely private and untouched.

This is the problem Object-WIPER solves. It's a new "magic eraser" for videos that doesn't just delete the unwanted person or object; it also deletes their shadows, reflections, and the weird ripples they cause, all without needing to be retrained on new data.

Here is how it works, broken down into simple concepts and analogies:

1. The Problem: The "Ghost" Effect

Most video editing tools today are like a clumsy painter. If you tell them to paint over a person, they might fill in the person's body with background grass. But they often forget to paint over the person's shadow or their reflection in a puddle.

The result? You end up with a floating shadow or a reflection of nothing, which looks like a ghost haunting your video. Previous AI tools either needed massive amounts of training data (like a student studying for years) or they just couldn't find these "ghosts" (shadows/reflections) to remove them.

2. The Solution: Object-WIPER (The Smart Detective)

Object-WIPER is a "training-free" tool. Think of it as a detective who already knows how the world works because they studied a massive library of movies and videos before they started working. They don't need to go to school again; they just use their existing knowledge to solve the case.

It works in three main steps:

Step A: Finding the "Hidden" Parts (The Detective's Magnifying Glass)

You give the AI a mask (a rough outline) of the object you want to remove, like a duck. You also tell it, "Remove the duck and its reflection."

Instead of just looking at the duck, the AI uses a special "attention" mechanism. Imagine the AI is reading a script (the text prompt) while looking at the video. It asks the video: "Hey, which pixels are talking about the 'duck' and the 'reflection'?"

  • Cross-Attention: It finds the pixels that match the word "duck."
  • Self-Attention: It then looks at those pixels and asks, "Who are you friends with?" It realizes the pixels representing the "reflection" are hanging out with the "duck" pixels.
  • Result: It creates a perfect, expanded mask that includes the duck and its reflection, filling in the gaps that a human might miss.

Step B: The "Rewind and Reset" (The Time Traveler)

Once it knows exactly what to remove, it performs a magic trick called Inversion.

  • Imagine the video is a complex puzzle. The AI "rewinds" the video back to a state of pure static noise (like TV snow), but it carefully saves the parts of the puzzle that represent the background (the grass, the sky).
  • It then takes the "duck" part of the puzzle and throws it away, replacing it with fresh, random static noise.
  • Crucial Move: As it starts rebuilding the video (Denoising), it forces the new "duck" area to look at the saved "background" pieces for inspiration. It tells the new pixels: "Don't look at the old duck; look at the grass behind it and copy that."

Step C: The "Smart Paint" (The Art Restorer)

As the video rebuilds itself from the noise, the AI uses a technique called Attention Scaling.

  • Think of this as a volume knob. During the early stages of rebuilding, it turns the volume down on the "duck" area so it doesn't accidentally bring the duck back. It turns the volume up on the "background" area so the grass and sky flow naturally into the empty space.
  • By the time the video is finished, the duck and its reflection are gone, replaced by a seamless, realistic background.

3. The New Scorecard: TokSim

The paper also points out a funny problem with how we grade these tools. Usually, we use metrics like "PSNR" (which measures how close the pixels are to the original).

  • The Flaw: If an AI just copies the original video and doesn't remove the duck at all, the score is perfect! It's like a student copying the test answers and getting an A, even though they didn't solve the problem.
  • The Fix: The authors created a new score called TokSim. Instead of just counting pixels, it checks:
    1. Did the background look consistent over time? (No flickering ghosts).
    2. Does the new background look like it belongs there?
    3. Is the duck actually gone?
      If the duck is still there, the score crashes. If the duck is gone and the grass looks real, the score goes up.

Why This Matters

  • No Training Needed: You don't need a supercomputer to train this model. It works out of the box.
  • Real-World Ready: It handles tricky stuff like translucent objects (glass), mirrors, and moving shadows, which previous tools failed at.
  • Clean Results: It doesn't just cut out the object; it heals the wound so the video looks like the object was never there.

In summary: Object-WIPER is like a highly skilled film editor who can look at a messy scene, identify not just the actor but their shadow and reflection, and then seamlessly "paint over" the entire mess with the background, making it look like the scene was always perfect. And it does all this without needing to go back to film school.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →