VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

Imagine you are trying to watch a movie and simultaneously keep track of every character's actions, who they are, and what they are doing, all while the scene is constantly changing. This is essentially what Video Segmentation does for computers: it identifies objects in a video, draws a mask around them, and tracks them as they move from frame to frame.

For a long time, building a computer to do this was like building a massive, overly complicated factory. You needed one team of workers just to identify the objects in a single picture (the Segmenter), and a completely separate, highly specialized team just to chase those objects across the movie frames (the Tracker). These "Tracker" teams were complex, slow, and required a lot of computing power, making the whole process sluggish.

The paper "VidEoMT: Your ViT is Secretly Also a Video Segmentation Model" proposes a radical new idea: What if we fired the specialized tracking team and just asked the main factory manager to do everything?

Here is the breakdown of their discovery using simple analogies:

1. The Old Way: The Over-Engineered Factory

Think of the old video segmentation models (like CAVIS) as a factory with two distinct departments:

The Photo Department: Takes a snapshot, identifies a dog, and draws a box around it.
The Chase Department: A team of detectives who take that box and run through the next 100 frames, shouting, "That's the same dog! Don't lose him!"

This works well, but it's slow. The "Chase Department" is heavy, complicated, and requires a lot of energy to run.

2. The New Idea: The "Super-Manager" (VidEoMT)

The authors realized that the "Photo Department" (which they call a Vision Transformer or ViT) is actually incredibly smart because it was trained on millions of images beforehand. It already knows what things look like and how they relate to each other.

They asked: Why do we need a separate "Chase Department" if the "Photo Department" is already so smart?

They created VidEoMT, a model that fires the specialized trackers and lets the main "Photo Manager" do the tracking too. It's like hiring a single, highly skilled detective who can not only identify the suspect but also chase them through the whole movie without needing a backup team.

3. How Does the "Super-Manager" Remember?

If you just show a smart detective a new photo every second, they might forget who the dog was in the previous photo. To fix this, VidEoMT uses two clever tricks:

The "Note-Taking" Trick (Query Propagation):
Imagine the detective finishes identifying the dog in Frame 1. Instead of throwing away their notes, they pass a sticky note to the next frame saying, "Hey, keep an eye on this specific dog." This allows the model to carry information forward without needing a separate tracking machine.
The "Newcomer" Trick (Query Fusion):
What if a new dog runs into the scene in Frame 5? The detective needs to know to look for new things, not just the old dog. VidEoMT mixes the "sticky notes" from the past with a fresh set of "search warrants" for new objects. This ensures the model doesn't get stuck only looking at old things and misses new arrivals.

4. The Result: Lightning Fast

The results are staggering. By removing the heavy, specialized tracking machinery and letting the "Super-Manager" (the pre-trained ViT) do the work:

Speed: The new model is 5 to 10 times faster than the old state-of-the-art models. It can process video at up to 160 frames per second (like watching a high-speed race in slow motion), whereas the old models were stuck at around 15 frames per second.
Accuracy: Despite being much simpler and faster, it is just as good at finding and tracking objects as the complex, heavy models.
Simplicity: It's like replacing a 50-piece Swiss Army knife with a single, incredibly sharp blade that does everything you need.

The Big Takeaway

The paper proves that we don't need to build increasingly complex, heavy, and slow machines to track objects in videos. If we use a sufficiently large and well-trained "brain" (the Vision Transformer) and give it a simple way to remember the past (the sticky notes), it can handle the job of both seeing and chasing on its own.

In short: The computer's "brain" was secretly a tracker all along; we just needed to stop over-complicating the system and let it do its job.

1. Problem Statement

Current state-of-the-art (SOTA) online video segmentation models typically rely on a decoupled architecture consisting of two complex components:

A Segmenter: Generates per-frame segmentation masks and class labels.
A Tracker: Matches object queries across frames to maintain temporal consistency.

These systems often incorporate specialized modules such as ViT-Adapters, pixel decoders, context-aware feature extractors, and re-identification (ReID) layers. While effective, this complexity introduces significant architectural overhead and computational cost, limiting inference speeds (often <20 FPS) and hindering real-time applications.

The authors hypothesize that this complexity is redundant. They propose that large-scale pre-trained Vision Transformers (ViTs), specifically those trained with self-supervised objectives (like DINOv2), inherently possess the representational power to perform both segmentation and temporal tracking without the need for specialized downstream modules.

2. Methodology: VidEoMT

The authors propose VidEoMT (Video Encoder-only Mask Transformer), a unified, encoder-only architecture that eliminates the need for separate tracking modules.

Core Design Principles

Encoder-Only Architecture: Unlike traditional methods that use a ViT encoder followed by a complex decoder or tracker, VidEoMT performs all computations (segmentation and temporal association) within a single ViT encoder.
Foundation Model Backbone: The model utilizes a pre-trained ViT (e.g., DINOv2) as the backbone. The authors argue that the strong feature representations learned during pre-training (which encourage consistent features across different views) are sufficient for tracking.

Key Mechanisms

To enable temporal modeling within an encoder-only framework, VidEoMT introduces two lightweight mechanisms:

Query Propagation:
- Instead of re-initializing learnable queries for every frame (which breaks temporal continuity), the model carries over track queries from the previous frame ( $t-1$ ) to the current frame ( $t$ ).
- At $t=0$ , standard learnable queries are used. For $t>0$ , the output queries from the previous frame are fed back into the ViT encoder as input queries.
- This allows information to flow across time without additional computational cost per frame.
Query Fusion:
- A limitation of pure propagation is the inability to detect newly appearing objects, as the model relies solely on past queries.
- To solve this, VidEoMT employs a Query Fusion strategy. The propagated queries from the previous frame are transformed via a lightweight linear layer and combined (element-wise addition) with a set of temporally-agnostic learned queries.
- Formula: $Q^F_t = \text{Linear}(Q^S_{t-1}) + Q^{lrn}$
- This balances temporal continuity (via propagated queries) with adaptability (via learned queries for new objects).

Simplification Process

The authors validated their approach by systematically stripping components from the SOTA model CAVIS:

Replaced the heavy segmenter with EoMT (Encoder-only Mask Transformer).
Removed Context-Aware Features (convolutional boundary filtering).
Removed Re-identification Layers (contrastive learning MLPs).
Removed the explicit Tracker module entirely, replacing it with the Query Propagation/Fusion mechanism.

3. Key Contributions

Unified Encoder-Only Design: Proposed VidEoMT, which unifies segmentation and temporal association within a single ViT encoder, removing the need for decoupled segmenters and trackers.
Demonstration of Redundancy: Showed that specialized components (ViT-Adapters, ReID layers, context features) are largely redundant when leveraging sufficiently large, pre-trained ViTs.
Lightweight Temporal Mechanisms: Introduced Query Propagation and Query Fusion to enable tracking in an encoder-only setting with negligible architectural complexity.
Efficiency Breakthrough: Achieved SOTA-level accuracy while being 5× to 10× faster than existing methods, reaching up to 160 FPS with a ViT-L backbone.

4. Experimental Results

The model was evaluated on six major benchmarks: YouTube-VIS (2019, 2021, 2022), OVIS, VIPSeg, and VSPW.

Performance vs. CAVIS (SOTA):
- On YouTube-VIS 2019, VidEoMT (ViT-L) achieved 68.6 AP compared to CAVIS's 68.9 AP, while running at 160 FPS vs. CAVIS's 15 FPS (a 10.6× speedup).
- On YouTube-VIS 2022, VidEoMT achieved 42.6 AP vs. CAVIS's 39.5 AP, with a speedup from 15 FPS to 161 FPS.
Video Panoptic Segmentation (VIPSeg): VidEoMT achieved 55.2 VPQ (vs. CAVIS 56.9) at 75 FPS (vs. CAVIS 10 FPS), a 7.5× speedup.
Video Semantic Segmentation (VSPW): VidEoMT outperformed existing methods in both accuracy (mIoU +2.1 over DVIS++) and temporal consistency, running 5× faster.
Ablation Studies:
- Pre-training: The performance gap between VidEoMT and complex models narrows significantly with stronger pre-training (DINOv2/DINOv3/EVA-02) compared to smaller pre-training (ImageNet-1K).
- Model Size: As ViT size increases (S $\to$ B $\to$ L), the accuracy gap between VidEoMT and CAVIS decreases, confirming that larger models can better internalize tracking capabilities.

5. Significance and Impact

Paradigm Shift: The paper challenges the prevailing belief that video segmentation requires complex, multi-stage pipelines. It suggests that foundation models have already learned the necessary priors for tracking, making specialized downstream modules obsolete.
Real-Time Viability: By achieving 160 FPS with high accuracy, VidEoMT makes high-quality online video segmentation feasible for latency-sensitive applications (e.g., autonomous driving, robotics, live video analytics) where current SOTA models are too slow.
Efficiency: The removal of specialized modules reduces FLOPs and parameter count while drastically improving inference speed through better hardware utilization of standard Transformer blocks.
Simplicity: The architecture is significantly simpler to implement and train, relying on standard ViT components and a simple query fusion strategy.

In conclusion, VidEoMT demonstrates that a "plain" ViT, when sufficiently pre-trained and equipped with a lightweight query propagation mechanism, can outperform or match complex, specialized video segmentation models while being an order of magnitude faster.