Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

The paper introduces Track Anything Behind Everything (TABE), a zero-shot pipeline that leverages test-time fine-tuning of a pretrained video diffusion model to perform amodal video object segmentation using only a single visible query mask, eliminating the need for class-specific pretraining or retraining.

Finlay G. C. Hudson, William A. P. Smith

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are watching a magic show. A magician places a ball on a table and then covers it with a large, opaque cup. To your eyes, the ball has vanished. But your brain knows the ball is still there, hidden underneath. You can even guess its shape and position, even though you can't see it. This mental trick is called object permanence, and the specific act of "filling in the missing parts" of an object is called amodal completion.

For a long time, computers have been terrible at this. If a car drives behind a tree in a video, a standard computer vision system just stops tracking it or guesses wildly. It lacks the human intuition to say, "Ah, the car is still there, just hidden."

This paper introduces a new system called TABE (Track Anything Behind Everything) that teaches computers to do exactly what humans do: imagine the whole object, even when parts of it are invisible.

Here is how TABE works, explained through simple analogies:

1. The Problem: The "Blind Spot" in Computer Vision

Most current AI tools are like a security camera that only records what is directly in front of the lens. If a person walks behind a pillar, the camera loses them. To fix this, older AI methods tried to memorize specific types of objects (like "only cars" or "only dogs"). But this is like having a librarian who only knows how to find books about cats; if you ask for a book about dogs, they can't help.

TABE is different. It is Zero-Shot, meaning it doesn't need to be taught what a "dog" or a "car" is beforehand. You just point at an object in the first frame of a video, and it says, "Got it. I'll track that thing, even if it disappears behind a wall."

2. The Solution: The "Generative Artist"

Instead of just guessing where the object might be, TABE uses a Video Diffusion Model. Think of this model as a highly skilled, imaginative artist who has seen millions of videos.

  • The Input: You give the artist a video and a "query mask" (a simple outline of the object you want to track in the first frame).
  • The Process: The artist looks at the visible parts of the object. Then, using their vast knowledge of how things move and deform, they paint (or "outpaint") the missing parts.
  • The Magic: If a ball rolls behind a box, the artist doesn't just leave a blank spot. They "hallucinate" (in a good way) the rest of the ball, drawing it as if it were sitting right behind the box, maintaining its shape and motion.

3. The Secret Sauce: "Test-Time Fine-Tuning"

Here is the clever part. Usually, these "artist" models are generic. They know how to paint a generic ball, but they might not know the specific scratches or unique color of your ball.

TABE does something special called Test-Time Fine-Tuning.

  • Imagine you hire a painter to copy a specific painting. Instead of giving them a generic instruction, you show them a few photos of your specific ball first.
  • The painter quickly learns the unique texture and shape of your ball.
  • Then, they go back to the video and use that specific knowledge to fill in the hidden parts.

This happens instantly while the video is playing (at "test time"), so the model becomes a specialist for that specific object in that specific video.

4. Keeping the Artist Honest: The "Target Region"

There is a risk with these creative artists: they might get too imaginative. If you ask them to fill in a hidden person, they might accidentally draw a second person or some random background noise.

To stop this, TABE uses Target Region Masks.

  • Think of this as putting a stencil over the canvas. The artist is only allowed to paint inside the area where the object could logically be.
  • They use depth sensors (like a 3D map) to figure out: "The tree is in front, so the object must be behind it, but not floating in the sky."
  • This keeps the "hallucination" focused and accurate, ensuring the computer only fills in the missing object, not random junk.

5. The Result: Seeing the Invisible

The final output is a video where the computer has drawn the complete object for every single frame, even the ones where the object was 100% hidden.

Why does this matter?

  • Better Self-Driving Cars: If a car is hidden behind a truck, the AI knows exactly where it is and when it will reappear, making driving safer.
  • Robotics: A robot can pick up an object even if it's partially blocked by another item, because it "sees" the whole shape.
  • Video Editing: You could easily remove an object from a video, and the AI would fill in the background perfectly because it understands the full 3D structure of the scene.

Summary

In short, TABE is a system that combines a "smart tracker" with a "creative painter." It watches a video, identifies an object, and then uses its imagination (guided by strict rules) to draw the invisible parts of that object as it moves behind obstacles. It allows computers to finally understand that just because you can't see something, it doesn't mean it's gone.