TrajTok: Learning Trajectory Tokens enables better Video Understanding

Imagine you are trying to describe a busy scene at a football game to a friend over the phone.

The Old Way (Traditional Video AI):
Currently, most AI models look at a video like a camera taking a photo every millisecond, chopping that photo into thousands of tiny, identical square tiles (like a giant pixel grid). To understand a 10-second clip, the AI has to process thousands of these tiny tiles, even if 90% of them are just empty sky or static grass. It's like trying to describe the game by listing the color of every single blade of grass and every speck of dust in the stadium. It's incredibly slow, wastes a lot of energy, and the AI gets overwhelmed by the sheer amount of "noise."

The New Way (TrajTok):
The paper introduces TrajTok, which changes the game by teaching the AI to think like a human observer. Instead of looking at tiny tiles, TrajTok learns to track objects as they move.

Think of it like this:

Old AI: "Here is a red pixel, here is a green pixel, here is a blue pixel... here is a red pixel again..."
TrajTok: "Here is a player running from left to right. Here is the ball flying through the air. Here is the referee walking."

How It Works (The Magic Trick)

The paper proposes a system that does three main things, all in one smooth motion:

1. The "Smart Grouping" (The Trajectory Segmenter)
Imagine a magic spotlight that doesn't just shine on the whole field, but automatically highlights specific players and the ball as they move across the screen.

Old Method: Previous attempts to do this used a separate, slow, pre-made tool (like a human editor manually drawing lines around players before the AI could even look). This was slow and rigid.
TrajTok's Method: The AI learns to draw these lines itself while it's learning the task. It's like teaching a student to draw the players while they are taking the test, rather than giving them a pre-drawn map. It groups pixels together based on where they are moving, creating "trajectory tokens" (packets of information about a moving object).

2. The "Flexible Summarizer" (The Trajectory Encoder)
Once the AI has grouped the moving objects, it needs to summarize them.

The Problem: Sometimes a player is just standing still (needs a simple summary). Sometimes they are doing a complex flip (needs a detailed summary).
The Solution: TrajTok is flexible. It can use one token to describe a simple movement, or four tokens to describe a complex, twisting motion. It's like having a variable-length sentence: "The ball moved" vs. "The ball spun, bounced, and rolled." This saves space when things are simple but keeps detail when things get complicated.

3. The "End-to-End" Learning
The most important part is that this isn't a separate tool. It's built right into the brain of the AI.

Analogy: Imagine a translator who doesn't just translate words but learns the context of the conversation at the same time. Because TrajTok learns alongside the main AI, it learns exactly what kind of "object tracking" helps the AI answer questions or recognize actions best. It doesn't care about perfect pixel-perfect outlines; it cares about understanding the scene.

Why This Matters (The Results)

The paper shows that this new approach is a game-changer in three ways:

Speed & Efficiency: Because it ignores the empty background and focuses only on moving objects, it processes video much faster and uses less computer power. It's like reading a book by only reading the dialogue and skipping the descriptions of the furniture.
Smarter Understanding: When tested on video quizzes and search tasks, TrajTok got significantly higher scores than previous methods. It understands what is happening better because it sees the story of the objects, not just a grid of colors.
Versatility: It works everywhere.
- As a Brain: It can be the main engine for a new video AI (TrajViT2).
- As a Plug-in: It can be added to existing AI brains to make them smarter without retraining the whole thing (TrajAdapter).
- As a Translator: It helps connect video AI to language models (like Chatbots), allowing them to answer questions about long videos much better (TrajVLM).

The Bottom Line

TrajTok is like upgrading a video AI from a "pixel counter" to a "storyteller." Instead of getting lost in millions of tiny, redundant details, it learns to follow the actors and the action. It's faster, smarter, and more adaptable, making it possible for computers to understand long, complex videos without needing a supercomputer to do it.

1. Problem Statement

Current video understanding models, predominantly based on Transformers, rely on patchification (splitting video frames into fixed space-time grids) for tokenization. This approach suffers from two critical limitations:

Inefficiency & Redundancy: It generates an excessive number of tokens, many of which are redundant (e.g., static backgrounds), leading to severe memory bottlenecks and high computational costs, especially as video resolution and duration increase.
Rigidity: Fixed token counts prevent models from adapting to the semantic complexity of the input.

Recent trajectory-based approaches (e.g., TrajViT) attempted to solve this by grouping pixels into object trajectories, decoupling video duration from token count. However, these methods rely on external, non-differentiable pipelines (using models like SAM2 and trackers) to generate trajectories. This creates a slow, pre-processing bottleneck and fixes the semantic granularity of tokens regardless of the downstream task's specific needs (e.g., needing fine-grained body part tokens vs. whole-body tokens).

2. Methodology: TrajTok

The authors propose TrajTok, an end-to-end, differentiable trajectory tokenizer that learns to group visual inputs into object trajectories directly within the network, co-trained with the downstream objective.

Core Architecture

TrajTok consists of two differentiable components:

Universal Segmenter (Trajectory Grouping):
- Input: Extracts high-resolution features from video frames using a lightweight patch encoder (ConvNeXt).
- Mechanism: Uses a set of learnable latent queries processed through Perceiver layers with cross-attention to the dense features.
- Soft Segmentation: Generates soft segmentation masks ( $M_{soft}$ ) via softmax over query-feature similarity. This allows gradients to flow back to the segmenter.
- Training: Trained with a combination of Dice Loss and Focal Loss (without standard cross-entropy) to prioritize discovering all object regions over pixel-perfect boundary accuracy. It uses pseudo-ground-truth masks generated by the TrajViT pipeline for pretraining.
- Key Insight: The segmenter prioritizes semantic grouping over pixel-perfect fidelity, trading segmentation accuracy for downstream task performance.
Trajectory Encoder:
- Aggregation: Aggregates patch features into compact tokens based on the segmentation masks.
- Refinement: Uses a second Perceiver module with hard attention masks (derived from $M_{soft}$ ) to refine the initial soft-aggregated embeddings, ensuring disentangled representations and recovering fine-grained motion details.
- Adaptive Token Count: Inspired by Matryoshka Representations, the encoder can output a variable number of tokens ( $n \in \{1, 2, 4\}$ ) per trajectory. This is achieved by duplicating the initial embedding and using distinct learnable queries (initialized with Fourier positional embeddings to encourage diversity) to capture complementary aspects of the same trajectory.

Training Strategy

End-to-End Co-training: The segmenter and encoder are trained jointly with the video backbone (e.g., ViT) using a CLIP objective (contrastive learning) and a segmentation loss.
Versatility: TrajTok can be used in three modes:
1. TrajViT2: Training a video encoder from scratch.
2. TrajAdapter: A plug-in adapter for frozen pretrained ViTs to enhance probing performance.
3. TrajVLM: A connector between a Vision Encoder and an LLM for Video-LLM tasks.

3. Key Contributions

End-to-End Differentiable Tokenization: Eliminates the need for slow, external segmentation/tracking pipelines, making trajectory tokenization fully trainable and adaptable to downstream objectives.
Semantic Adaptability: The tokenizer dynamically adjusts token granularity based on scene complexity and task requirements (e.g., merging background regions, splitting complex objects) rather than using fixed heuristics.
Adaptive Representation: Introduces a mechanism to vary the number of tokens per trajectory, balancing efficiency and expressivity for objects with complex motion.
Versatile Integration: Demonstrates that trajectory tokens are not just for pretraining but serve as effective adapters for frozen encoders and connectors for Vision-Language Models.

4. Experimental Results

The paper evaluates TrajTok across three scenarios:

A. TrajViT2 (Pretraining from Scratch)

Performance: Achieves state-of-the-art results on classification and retrieval benchmarks.
- Kinetics-400: +4.8% improvement over standard ViT.
- Something-Something V2: +4.1% improvement.
- Retrieval: Significant gains in video-text retrieval (e.g., +4.1% R@5 on ActivityNet).
Scaling: Exhibits superior scaling behavior compared to TrajViT (which relies on external pipelines) as dataset size increases from 1M to 8M samples.
Efficiency: Inference FLOPs are comparable to the most efficient token-merging methods (like ViViT) and significantly lower than patch-based ViTs or TrajViT.

B. TrajAdapter (Feature Adaptation)

Setup: Plugged into frozen VideoMAE-v2 and V-JEPA-2 backbones.
Result: Consistently outperforms linear probing, attentive probing, and Perceiver-only baselines on Kinetics-400 and SSv2.
Finding: The improvement comes from the trajectory priors, not just added parameters. Performance increases with more tokens per trajectory (up to 4).

C. TrajVLM (Vision-Language Models)

Setup: Integrated as a connector in an LLaVA-style model (Qwen3-4B + SigLIP2).
Result: Significantly outperforms patch-pooling baselines on long-video benchmarks (e.g., +8.8% on LongVideoBench).
Insight: TrajTok's semantically structured tokens reduce redundancy and support long-range reasoning better than naive patch pooling, which struggles with long contexts.

5. Significance and Impact

Paradigm Shift: Moves video tokenization from rigid, spatially uniform grids to object-centric, semantically grounded trajectories that are learned end-to-end.
Efficiency vs. Performance: Successfully resolves the trade-off where previous trajectory methods were accurate but slow (due to external pipelines), and token-reduction methods were fast but lost semantic structure.
Generalizability: Proves that "good enough" semantic grouping (even with imperfect pixel boundaries) is sufficient for high-level video understanding, challenging the necessity of pixel-perfect segmentation for video tasks.
Future Direction: Provides a unified, efficient module that can be seamlessly integrated into pretraining, fine-tuning, and large-scale VLM architectures, paving the way for more scalable and interpretable video AI.