Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Imagine you have a magical, black-box video generator. You type in a sentence like, "An alpaca runs on grass as lightning strikes," and out pops a beautiful, moving video. But here's the problem: How does the machine actually know what to move and when to move it? It's like watching a magician pull a rabbit out of a hat, but you have no idea how the trick works.

This paper introduces a new tool called IMAP (Interpretable Motion-Attentive Maps) that acts like an "X-ray vision" for these video generators. It lets us see exactly which parts of the video the AI is focusing on when it thinks about specific words like "running" or "lightning."

Here is a simple breakdown of how it works, using some everyday analogies:

1. The Problem: The "Black Box"

Video AI models (called Video DiTs) are incredibly smart, but they are also opaque. They take text and turn it into video, but we can't see their "thought process."

The Analogy: Imagine a chef cooking a complex dish. You taste the food and it's amazing, but you don't know which specific spice made it taste like "garlic" or which ingredient made it "spicy." You just know the final result.
The Goal: The authors wanted to peek into the chef's mind to see exactly when and where the "garlic" (or in this case, the "running") was added.

2. The First Tool: GramCol (The "Spotlight")

Before they could find the motion, they needed a way to find any object mentioned in the text. They created a method called GramCol.

How it works: The AI breaks the video down into tiny puzzle pieces (tokens). GramCol asks the AI: "Which puzzle piece looks most like the word 'alpaca'?"
The Analogy: Imagine you have a giant, blurry photo of a crowd. You tell a friend, "Find the person wearing a red hat." Your friend doesn't just guess; they use a special scanner that highlights every pixel that feels like a red hat. Suddenly, the person in the red hat glows, and the rest of the crowd fades away.
The Result: This creates a "saliency map"—a heat map showing exactly where the AI sees the "alpaca" or the "grass."

3. The Second Tool: Finding the "Motion Heads" (The "Dance Floor")

Knowing where the alpaca is isn't enough; we need to know when it moves. The AI model has thousands of internal "neurons" (called attention heads). The authors discovered that some of these neurons are specialized for spatial things (where things are), while others are specialized for temporal things (how things move over time).

The Analogy: Think of the AI model as a massive orchestra. Most musicians are playing the background music (the scenery). But there are a few specific violinists who are only playing the "running" notes.
The Trick: The authors developed a test to find these "violinists." They looked for the neurons that change the most from one video frame to the next. If a neuron is constantly changing its mind about where things are, it's probably the one handling the motion.
The Result: They filter out the boring, static neurons and keep only the "motion heads."

4. The Final Product: IMAP (The "Motion Heat Map")

By combining the "Spotlight" (GramCol) with the "Motion Heads," they created IMAP.

What it does: It shows you a video where the moving parts glow brightly, and the static parts are dark.
The Magic: It works without needing to retrain the AI or change its settings. It just reads the AI's existing "thoughts" while it's making the video.
The Analogy: Imagine watching a movie where, whenever a character starts running, a neon outline appears around their legs. If they stop, the neon fades. If a car drives by, the car glows. You can instantly see who is doing what and when.

Why is this a big deal?

Trust: It helps us trust the AI. If the AI says "lightning strikes," but the heat map shows the lightning glowing on the ground instead of the sky, we know the AI is confused.
No Training Needed: It's like a "plug-and-play" tool. You don't need to teach the AI anything new; you just use this new lens to look at what it's already doing.
Zero-Shot Segmentation: It can even be used to cut out moving objects from a video automatically, like a smart scissors that only cuts the things that are moving, without needing any human labels.

Summary

The authors built a pair of glasses that let us see the invisible "motion thoughts" inside video-generating AI. Instead of just watching the final movie, we can now see the director's notes, showing us exactly which parts of the scene the AI decided to animate and when. It turns a magic trick into a transparent, understandable process.

Here is a detailed technical summary of the paper "I'm a Map! Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers".

1. Problem Statement

Video Diffusion Transformers (Video DiTs) have achieved state-of-the-art results in generating high-fidelity videos from text descriptions. However, the internal mechanisms by which these models convert motion-related text tokens (e.g., "running," "lightning striking") into specific temporal movements in the generated video remain a "black box."

Existing interpretability methods for diffusion models primarily focus on:

Static Objects: Localizing nouns (e.g., "cat," "grass") in images or videos.
Spatial Attention: Methods like ConceptAttention provide spatial saliency but fail to capture temporal dynamics (when and how an object moves).
Lack of Motion Specificity: Prior works have not effectively isolated the specific attention heads or features responsible for motion generation, making it difficult to understand if the model truly "understands" motion or merely hallucinates it.

The core challenge is to develop a method that can spatio-temporally localize motion concepts within Video DiTs without requiring additional training or gradient calculations.

2. Methodology

The authors propose IMAP (Interpretable Motion-Attentive Maps), a training-free framework that extracts saliency maps from pre-trained Video DiTs. The pipeline consists of three main components:

A. Subject of Analysis: Layer and Timestep Selection

To avoid redundancy and noise, the method does not aggregate features across all layers and timesteps.

Timesteps: Early timesteps (high noise) are discarded as they contain memorization artifacts (e.g., watermarks) and lack semantic clarity.
Layers: The authors utilize the second-largest eigenvalue ( $\lambda_2$ ) of the attention weight matrix (based on Discrete-Time Markov Chain theory) to select layers. Layers with higher $\lambda_2$ values are found to produce sharper, more interpretable semantic features.

B. GramCol: Spatial Localization

To localize where a concept appears spatially, the authors introduce GramCol.

Query-Key (QK) Matching: Instead of using raw text embeddings directly (which can cause cross-modal artifacts), the method identifies a text-surrogate token. This is the visual token patch that has the highest attention score (similarity) with the target text concept token.
Gram Matrix Construction: Once the surrogate token is identified, the method computes the Gram matrix ( $G = h_x h_x^T$ ) of the visual token embeddings.
Saliency Extraction: The saliency map is derived from the column of the Gram matrix corresponding to the surrogate token. This effectively highlights all visual patches that are semantically similar to the concept, ensuring positive highlighting and avoiding negative artifacts common in cross-modal similarity calculations.

C. Motion Heads: Temporal Localization

To localize when and how motion occurs, the method identifies specific motion-related attention heads.

Hypothesis: Attention heads responsible for motion should exhibit significant differences in visual token embeddings across frames (high temporal variation).
Separation Score: The authors calculate a Calinski-Harabasz Index (CHI) for the visual token embeddings within each attention head, treating frames as clusters. Heads with high CHI scores indicate strong separation between frames, implying they encode motion dynamics.
IMAP Generation: The final IMAP is generated by aggregating GramCol maps only from the top- $k$ attention heads with the highest separation scores. This filters out static spatial features and isolates the motion-specific features.

3. Key Contributions

GramCol: A novel spatial localization technique that uses a text-surrogate token and the Gram matrix of visual embeddings to produce clear, positive saliency maps for any text concept (motion or non-motion) without gradient updates.
IMAP (Interpretable Motion-Attentive Maps): The first method to spatio-temporally localize motion concepts in Video DiTs by identifying and leveraging specific "motion heads" based on frame-wise feature separation.
Training-Free & Zero-Shot: The entire pipeline operates on pre-trained models without fine-tuning, parameter updates, or retraining. It works with arbitrary prompts and can be applied to existing videos via re-noising.
Broad Applicability: The method is applicable to both Joint-Attention (e.g., CogVideoX, HunyuanVideo) and Cross-Attention architectures.

4. Experimental Results

The authors evaluated IMAP on the MeViS dataset (for motion localization) and the VSPW dataset (for zero-shot video semantic segmentation).

Motion Localization:
- Metrics: Evaluated using an LLM-based scoring system (OpenAI o3-pro) across five dimensions: Spatial Localization (SL), Temporal Localization (TL), Prompt Relevance (PR), Specificity (SS), and Objectness (OBJ).
- Performance: IMAP significantly outperformed baselines (ViCLIP, DAAM, Cross-Attention, ConceptAttention) across all metrics. For example, on CogVideoX-5B, IMAP achieved an average score of 0.62, compared to 0.45 for the next best baseline (ConceptAttention).
- Qualitative: IMAP successfully highlighted the specific agent performing an action (e.g., the "walking" legs of a person) and the exact frames where the motion occurred, whereas baselines often highlighted static backgrounds or failed to distinguish motion from static objects.
Zero-Shot Video Semantic Segmentation:
- GramCol (the spatial component) achieved the highest mIoU (28.9%) among interpretable saliency map methods on the VSPW dataset, surpassing ConceptAttention (25.0%) and Cross-Attention (16.8%).
- While it lagged behind specialized supervised segmentation models, it demonstrated that Video DiT features contain rich semantic information usable for perception tasks without training.
Ablation Studies:
- Removing layer selection or motion head selection caused significant performance drops.
- The separation score (CHI) was proven to be a robust indicator for selecting motion heads (Pearson correlation $r=0.60$ with Motion Localization Score).

5. Significance

Demystifying Video Generation: IMAP provides a window into the "black box" of Video DiTs, confirming that these models possess distinct internal mechanisms (specific attention heads) dedicated to processing motion, separate from static object representation.
Diagnostic Tool: The method can be used to diagnose generation failures. For instance, if IMAP highlights an object performing a motion that is not visible in the final video, it indicates a failure in the generative process (e.g., the model "thought" about the motion but failed to render it).
Foundation for Future Research: By proving that motion can be localized without training, this work opens avenues for controlling video generation, editing specific motion segments, and improving alignment between text prompts and generated video dynamics.

In conclusion, IMAP represents a significant step forward in the interpretability of generative video models, offering a precise, automated, and training-free way to visualize how AI models understand and execute motion.