Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation

Imagine you are watching a magic show. A magician places a ball on a table and then covers it with a large, opaque cup. To your eyes, the ball has vanished. But your brain knows the ball is still there, hidden underneath. You can even guess its shape and position, even though you can't see it. This mental trick is called object permanence, and the specific act of "filling in the missing parts" of an object is called amodal completion.

For a long time, computers have been terrible at this. If a car drives behind a tree in a video, a standard computer vision system just stops tracking it or guesses wildly. It lacks the human intuition to say, "Ah, the car is still there, just hidden."

This paper introduces a new system called TABE (Track Anything Behind Everything) that teaches computers to do exactly what humans do: imagine the whole object, even when parts of it are invisible.

Here is how TABE works, explained through simple analogies:

1. The Problem: The "Blind Spot" in Computer Vision

Most current AI tools are like a security camera that only records what is directly in front of the lens. If a person walks behind a pillar, the camera loses them. To fix this, older AI methods tried to memorize specific types of objects (like "only cars" or "only dogs"). But this is like having a librarian who only knows how to find books about cats; if you ask for a book about dogs, they can't help.

TABE is different. It is Zero-Shot, meaning it doesn't need to be taught what a "dog" or a "car" is beforehand. You just point at an object in the first frame of a video, and it says, "Got it. I'll track that thing, even if it disappears behind a wall."

2. The Solution: The "Generative Artist"

Instead of just guessing where the object might be, TABE uses a Video Diffusion Model. Think of this model as a highly skilled, imaginative artist who has seen millions of videos.

The Input: You give the artist a video and a "query mask" (a simple outline of the object you want to track in the first frame).
The Process: The artist looks at the visible parts of the object. Then, using their vast knowledge of how things move and deform, they paint (or "outpaint") the missing parts.
The Magic: If a ball rolls behind a box, the artist doesn't just leave a blank spot. They "hallucinate" (in a good way) the rest of the ball, drawing it as if it were sitting right behind the box, maintaining its shape and motion.

3. The Secret Sauce: "Test-Time Fine-Tuning"

Here is the clever part. Usually, these "artist" models are generic. They know how to paint a generic ball, but they might not know the specific scratches or unique color of your ball.

TABE does something special called Test-Time Fine-Tuning.

Imagine you hire a painter to copy a specific painting. Instead of giving them a generic instruction, you show them a few photos of your specific ball first.
The painter quickly learns the unique texture and shape of your ball.
Then, they go back to the video and use that specific knowledge to fill in the hidden parts.

This happens instantly while the video is playing (at "test time"), so the model becomes a specialist for that specific object in that specific video.

4. Keeping the Artist Honest: The "Target Region"

There is a risk with these creative artists: they might get too imaginative. If you ask them to fill in a hidden person, they might accidentally draw a second person or some random background noise.

To stop this, TABE uses Target Region Masks.

Think of this as putting a stencil over the canvas. The artist is only allowed to paint inside the area where the object could logically be.
They use depth sensors (like a 3D map) to figure out: "The tree is in front, so the object must be behind it, but not floating in the sky."
This keeps the "hallucination" focused and accurate, ensuring the computer only fills in the missing object, not random junk.

5. The Result: Seeing the Invisible

The final output is a video where the computer has drawn the complete object for every single frame, even the ones where the object was 100% hidden.

Why does this matter?

Better Self-Driving Cars: If a car is hidden behind a truck, the AI knows exactly where it is and when it will reappear, making driving safer.
Robotics: A robot can pick up an object even if it's partially blocked by another item, because it "sees" the whole shape.
Video Editing: You could easily remove an object from a video, and the AI would fill in the background perfectly because it understands the full 3D structure of the scene.

Summary

In short, TABE is a system that combines a "smart tracker" with a "creative painter." It watches a video, identifies an object, and then uses its imagination (guided by strict rules) to draw the invisible parts of that object as it moves behind obstacles. It allows computers to finally understand that just because you can't see something, it doesn't mean it's gone.

1. Problem Definition

The paper addresses Zero-Shot Amodal Video Object Segmentation (ZS-AVS).

Amodal Segmentation: Unlike traditional (modal) segmentation which only outlines visible pixels, amodal segmentation aims to infer the complete shape and position of an object, including parts that are occluded or hidden behind other objects.
The Challenge: Existing methods often require pre-trained class labels (limiting them to specific object categories) or struggle with severe/total occlusion. Furthermore, obtaining ground truth for occluded regions in real-world video is difficult and ambiguous.
The Goal: To track an object across a video sequence and generate a complete mask for every frame, even when the object is fully occluded, using only a single query mask from the first frame (where the object is visible) without retraining on specific object classes.

2. Methodology: The TABE Pipeline

The authors propose TABE (Track Anything Behind Everything), a pipeline that frames amodal completion as a generative outpainting task using a pretrained video diffusion model. The process involves four key stages:

A. Input and Initial Tracking

Input: A video sequence and a query mask ( $m_q$ ) from the first frame (generated via point clicks or text, assumed unoccluded).
Visible Mask Generation: A zero-shot Video Object Segmentation (VOS) model (specifically SAM2) is used to generate visible masks for all subsequent frames. If an object is fully occluded, these masks are empty.

B. Pre-processing: Occlusion Reasoning & Target Region Masks

To guide the diffusion model and prevent hallucinations (e.g., generating extra people), the pipeline performs pre-processing before the diffusion step:

Target Region Masks: The diffusion model is restricted to a specific region where the object could be. This is calculated by:
- Depth Estimation: Using Depth Anything v2 to estimate depth. Pixels with depth values smaller (closer) than the mean depth of the visible object are candidates for outpainting.
- Amodal Bounding Box: An approximate bounding box is estimated using temporal continuity (linear interpolation/extrapolation). The candidate pixels are further restricted to lie within this box.
Occlusion Labelling: The system computes an occlusion measure ( $f_{occ}$ $f_{occ}$ ) based on the boundary of the visible mask and depth gradients.
- If the depth gradient indicates the object continues behind a closer object, it is labeled as occluded.
- If the depth gradient suggests the object boundary, it is labeled unoccluded.
- This labeling is crucial for the training loss during the fine-tuning phase.

C. Test-Time Diffusion Model Fine-Tuning

Instead of using a generic outpainting model (which fails to adhere to specific object constraints), TABE fine-tunes a pretrained video diffusion model (CoCoCo) at test-time for the specific object in the video.

Architecture: Based on CoCoCo (which adapts Stable Diffusion Inpainting for video using a temporal UNet).
Technique: Uses Low-Rank Adaptation (LoRA) and a fixed prompt with a rare token (similar to Dreambooth and Realfill) to overfit the model to the specific object's visual characteristics.
Training Strategy:
- Data Augmentation: Random binary masks are applied. Some occlude parts of the object (teaching the model to reconstruct hidden parts), while others mask the background (teaching the model to generate a consistent white background).
- Loss Function: The loss is computed only on frames labeled as unoccluded ( $V_i=1$ ). This prevents the model from learning from undefined occluded regions, ensuring it learns a robust representation of the object's true shape from visible data.
- Prompt: "A video of a [V] on a white background."

D. Post-Processing

The diffusion model generates a video of the object on a white background (amodal completion).
To remove potential artifacts or hallucinated background elements, the pipeline runs SAM2 again on the generated frames using the original query mask to produce the final, clean segmentation mask.

3. Key Contributions

Zero-Shot Capability: The method requires no retraining on specific object classes. It works on any object given a single query mask.
Generative Amodal Completion: It uniquely poses amodal segmentation as a generative outpainting problem using video diffusion, allowing it to handle severe and total occlusions where discriminative methods fail.
Test-Time Specialization: The introduction of a test-time fine-tuning strategy (LoRA) allows a general diffusion model to specialize in the specific tracked object, significantly improving adherence to object constraints.
Occlusion Reasoning Module: A novel pre-processing step that combines monodepth and temporal bounding box extrapolation to define target regions and label occlusion states without ground truth.

4. Experimental Results

The method was evaluated on the TAO-Amodal dataset (a real-world benchmark with human-estimated amodal boxes).

Benchmark: A custom subset of 100 clips where the query object is fully visible in the first frame.
Baselines: Compared against pix2gestalt (image-based), TCOW (video-based, requires specific classes), SDAmodal, and Amodal Expander.
Performance:
- TABE achieved an AP@25 of 0.659, significantly outperforming the next best method (Amodal Expander at 0.417) and the state-of-the-art TCOW (0.278).
- Notably, TABE outperformed SAM2 (which only predicts visible masks) on amodal metrics, proving that the generative completion adds value beyond simple tracking.
- Even when using an estimated bounding box (TABE Estimated Bbox), the method surpassed all other baselines, demonstrating robustness.

5. Significance and Impact

Bridging the Gap: TABE addresses the "sim2real" gap and the lack of ground truth for occluded objects by leveraging the generalization capabilities of large-scale video diffusion models trained on diverse datasets (e.g., WebVid-10M).
Human-Like Perception: The approach mimics human object permanence, allowing systems to "see" through occlusions and maintain object continuity, which is critical for robotics, autonomous driving, and advanced video understanding.
Flexibility: By avoiding the need for class-specific training data, TABE opens the door to tracking arbitrary objects in dynamic, unstructured environments.
Future Direction: The paper highlights that current evaluation metrics often favor re-identification over true amodal completion; TABE sets a new standard for evaluating how well models can infer hidden object geometry.

In summary, TABE represents a paradigm shift from discriminative segmentation to generative completion, utilizing test-time adaptation of diffusion models to solve the difficult problem of tracking objects through total occlusion in a zero-shot setting.