Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Imagine you have a super-smart robot named DINO that has looked at millions of photos but never been told what anything is called. It just "sees" patterns. Now, imagine you want to open up DINO's brain to see how it thinks.

For a long time, scientists thought DINO's brain worked like a library of straight lines. They believed that every idea (like "cat" or "tree") was just a specific direction in a giant, multi-dimensional space. If you wanted to find a "cat," you just pointed in the "cat direction."

This paper says: "Actually, it's not that simple. It's more like a city made of shapes."

Here is the story of what the researchers found, broken down into three parts:

Part 1: The Specialized Workers (What DINO does)

The researchers built a massive dictionary of 32,000 tiny visual concepts (like "fuzzy texture," "sharp edge," or "blue sky") that DINO uses. They found that DINO doesn't use all these concepts for every job. It hires specific teams for specific tasks:

The "Not-Here" Detectives (Classification): When DINO tries to guess what an animal is, it doesn't just look at the animal. It also looks at everything around the animal and says, "This is definitely not the animal." It's like a security guard who identifies a VIP not just by seeing them, but by noticing that everyone else is standing in the wrong spot.
The Outline Artists (Segmentation): When DINO needs to cut an object out of a photo, it uses a special team of concepts that only fire along the edges. They are like a painter who only uses a brush to trace the border of a shape.
The 3D Guessers (Depth): Even though DINO only sees flat 2D pictures, it has a team of experts that understand depth. They look for shadows, perspective lines (like train tracks meeting in the distance), and blurry textures to guess how far away things are.

Part 2: The Shape of the Brain (How the concepts are arranged)

The researchers expected the 32,000 concepts to be scattered randomly, like marbles in a box. Instead, they found a very organized structure:

The "Antipodal" Pairs: Some concepts are opposites that live on the same line. Think of a single ruler where one end is "White Shirt" and the other end is "Black Shirt." They aren't two different directions; they are two ends of the same stick.
The "Register" Tokens: DINO has a few special "helper" tokens that don't look at specific parts of the image. Instead, they act like a weather report for the whole picture. They tell DINO things like, "The whole image is blurry," or "The lighting is dim," or "There is motion."
The Smooth Map: If you look at how DINO sees a single photo, the concepts don't jump around randomly. They flow smoothly, like a river. If you move from a cat's ear to its nose, the concepts change gradually, not abruptly.

Part 3: The Big Idea (The "Minkowski" Hypothesis)

This is the most creative part. The researchers propose a new way to understand DINO's brain, which they call the Minkowski Representation Hypothesis.

The Old Way (Linear): Imagine you are building a house. You have a pile of straight wooden planks (directions). To build a wall, you just stack the planks.

The New Way (Minkowski/Convex): Imagine you are building a house out of Lego blocks.

You have a "Cat" block, a "Brown" block, and a "Fluffy" block.
When DINO sees a brown, fluffy cat, it doesn't just point to a "cat" direction. It mixes these blocks together.
The final image is a blend (a convex mixture) of these archetypes.

The "Minkowski Sum" Analogy:
Imagine you have a bag of different shapes (polytopes).

One bag has shapes for Position (Left, Right, Center).
One bag has shapes for Object (Cat, Dog, Car).
One bag has shapes for Lighting (Sunny, Dark).

DINO's brain doesn't pick one shape from each bag. It adds them together. The final result is a complex shape that is the sum of all those smaller shapes.

Why does this matter?

It explains the "Smoothness": Because DINO is mixing shapes, the transition from "cat" to "dog" is a smooth slide through the space between the shapes, not a jump between two lines.
It explains the "Limits": If you try to push DINO to "see" a cat by just turning up the "cat" dial, it eventually stops working. Why? Because you've pushed the shape so far it's no longer a valid "cat" shape anymore; it's broken. You can't just keep going in a straight line forever; you have to stay inside the "shape" of the concept.

The Takeaway

The paper tells us that AI vision isn't just a list of directions. It's a geometric city made of overlapping shapes.

Concepts are regions (like a neighborhood), not just points (like a street address).
To understand AI, we shouldn't just look for "lines"; we should look for shapes and how they blend together.

It's like realizing that to understand a painting, you shouldn't just look at the individual brushstrokes (the lines); you need to see how the colors blend to form the final image (the shapes).

1. Problem Statement

Despite the widespread deployment of Vision Transformers (ViTs) like DINOv2 for diverse downstream tasks (classification, segmentation, depth estimation), the internal nature of their representations remains poorly understood.

The Gap: Current interpretability methods often rely on the Linear Representation Hypothesis (LRH), which posits that neural activations are sparse superpositions of nearly orthogonal, independent directions (features). However, empirical evidence suggests this view may be insufficient to explain the complex, structured, and interpolative geometry observed in modern foundation models.
The Question: How are visual concepts organized within DINOv2? Are they strictly linear directions, or do they exhibit more complex geometric structures (e.g., convex regions, task-specific subspaces, or dense positional components)?

2. Methodology

The authors employ a multi-stage approach combining large-scale dictionary learning, statistical analysis, and geometric modeling.

A. Concept Extraction via Stable Sparse Autoencoders (SAEs)

Operationalization of LRH: The authors train a Stable Sparse Autoencoder on DINOv2-B activations. Unlike standard SAEs, they constrain dictionary atoms to lie within the convex hull of real activations to ensure stability and in-distribution validity.
Scale: They extract a dictionary of 32,000 concept atoms ( $c \gg d$ , where $d=768$ ) from 1.4 million ImageNet-1K images.
Output: A dictionary $D$ (archetypes) and sparse codes $Z$ representing the activation of these concepts.

B. Three-Part Analysis

Downstream Usage: Analyzing how specific tasks (classification, segmentation, depth) recruit concepts from the dictionary.
Geometry & Statistics: Examining the statistical properties (sparsity, co-activation) and geometric structure (coherence, anisotropy) of the learned dictionary.
Token-Level Geometry: Investigating the local geometry of patch tokens within single images, specifically looking at positional encoding compression and manifold structure.

C. Theoretical Framework: Minkowski Representation Hypothesis (MRH)

Based on empirical findings, the authors propose a refined geometric model:

Concept: Representations are not just linear sums of directions but Minkowski sums of convex polytopes.
Mechanism: Multi-head attention naturally produces convex combinations of value vectors (polytopes). The final token embedding is the sum of these polytopes from different heads.
Implication: Concepts are defined as regions (proximity to archetypes) rather than unbounded linear directions.

3. Key Contributions & Results

A. Functional Specialization of Concepts

Different tasks recruit distinct, low-dimensional subspaces of the concept dictionary:

Classification: Relies heavily on "Elsewhere" concepts. These are not background detectors but implement a learned negation: they fire everywhere except on the target object, conditional on the object's presence. This suggests DINO learns fuzzy spatial logic ("not the object, but the object exists").
Segmentation: Relies on boundary concepts that form coherent, low-dimensional subspaces. These concepts consistently activate along object contours (limbs, edges) regardless of the specific object class.
Depth Estimation: Draws on three distinct families of monocular depth cues consistent with visual neuroscience:
1. Projective geometry (vanishing lines).
2. Shadow-based cues (lighting gradients).
3. Local frequency transitions (texture/bokeh).
Token Specialization:
- Register Tokens: Encode global, non-local scene properties (illumination, motion blur, lens effects, style) rather than object parts.
- CLS Tokens: Contain a unique "ID" concept.
- Positional Compression: Positional information compresses from a high-rank space in early layers to a 2D subspace in later layers, yet local token connectivity persists even after removing positional components.

B. Deviations from Linear Sparsity (LRH)

The study finds that DINOv2 representations deviate from the ideal "Grassmannian frame" (maximally orthogonal, sparse directions):

Anisotropy & Clustering: The dictionary exhibits heavy-tailed pairwise inner products and sharp singular value decay, indicating that atoms are clustered and aligned with task subspaces rather than being uniformly distributed.
Antipodal Pairs: The model utilizes pairs of nearly opposite vectors ( $D_i \approx -D_j$ ) to encode semantic oppositions (e.g., "left" vs. "right", "white" vs. "black"), challenging the assumption that cosine similarity alone captures feature relationships.
Dense Components: While mostly sparse, the dictionary contains dense positional features (e.g., "left side of image") that fire across the entire dataset, suggesting a hybrid sparse-dense regime.
Local Connectivity: Token embeddings within a single image form smooth, low-dimensional manifolds that align with object boundaries, persisting even after removing positional encodings.

C. The Minkowski Representation Hypothesis (MRH)

The authors synthesize these findings into the MRH:

Definition: A token activation $x$ is a block-convex coding of archetypes. The activation space $X$ is the Minkowski sum of polytopes ( $X = \bigoplus P_i$ ), where each $P_i$ is the convex hull of archetypes associated with a specific "tile" (e.g., a specific attention head or semantic category).
Theoretical Basis:
- Attention Mechanism: A single attention head outputs a convex combination of values (a point in a polytope). Multi-head attention sums these outputs, creating a Minkowski sum.
- Gärdenfors' Conceptual Spaces: Aligns with cognitive theories where concepts are convex regions in a quality space.
Empirical Support:
- Geodesics: Shortest paths on token $k$ -NN graphs (piecewise-linear) stay close to the data manifold, whereas linear interpolations leave the manifold.
- Archetypal Analysis: Reconstructing tokens using convex combinations of a small number of archetypes (10) performs nearly as well as SAEs, supporting the convex coding assumption.
- Block Structure: Archetypal coefficients show emergent block-sparsity, aligning with the "tiles" in the hypothesis.

4. Significance and Implications

Redefining Interpretability: The paper argues that concepts should be viewed as bounded regions (polytopes) rather than unbounded linear directions. This shifts the focus from "steering" along a vector to "steering" toward a specific landmark or region.
Steering Limits: Under MRH, steering an activation toward a concept has a strict maximum. Once the activation reaches the archetype (or the convex cell), further movement in that direction pushes the embedding off-manifold, explaining why SAE steering gains often plateau or reverse.
Non-Identifiability: The paper proves that decomposing a Minkowski sum into its constituent polytopes is non-unique. This implies that recovering the "true" generative factors from a single layer's activations is mathematically impossible without additional structural constraints (e.g., accessing attention weights from previous layers).
Tool Release: The authors release DinoVision, an interactive tool visualizing the 32,000 concepts, allowing the community to explore these task-specific and token-specific patterns.

Conclusion

This work challenges the prevailing Linear Representation Hypothesis by demonstrating that DINOv2 representations are organized as Minkowski sums of convex polytopes. By combining large-scale sparse autoencoders with geometric analysis, the authors reveal that the model organizes visual concepts into functional, low-dimensional subspaces (for tasks like depth and segmentation) and utilizes convex interpolation between archetypes. This new geometric perspective offers a more accurate framework for understanding, steering, and interpreting large vision models.