Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

By applying Sparse Autoencoders to DINOv2, this study reveals that task-specific concepts exhibit functional specialization and a non-sparse, locally connected geometry, leading to the proposal of the Minkowski Representation Hypothesis, which posits that vision transformer tokens are formed by convex mixtures of archetypes within conceptual spaces rather than strict linear sparsity.

Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, Martin Wattenberg

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you have a super-smart robot named DINO that has looked at millions of photos but never been told what anything is called. It just "sees" patterns. Now, imagine you want to open up DINO's brain to see how it thinks.

For a long time, scientists thought DINO's brain worked like a library of straight lines. They believed that every idea (like "cat" or "tree") was just a specific direction in a giant, multi-dimensional space. If you wanted to find a "cat," you just pointed in the "cat direction."

This paper says: "Actually, it's not that simple. It's more like a city made of shapes."

Here is the story of what the researchers found, broken down into three parts:

Part 1: The Specialized Workers (What DINO does)

The researchers built a massive dictionary of 32,000 tiny visual concepts (like "fuzzy texture," "sharp edge," or "blue sky") that DINO uses. They found that DINO doesn't use all these concepts for every job. It hires specific teams for specific tasks:

  • The "Not-Here" Detectives (Classification): When DINO tries to guess what an animal is, it doesn't just look at the animal. It also looks at everything around the animal and says, "This is definitely not the animal." It's like a security guard who identifies a VIP not just by seeing them, but by noticing that everyone else is standing in the wrong spot.
  • The Outline Artists (Segmentation): When DINO needs to cut an object out of a photo, it uses a special team of concepts that only fire along the edges. They are like a painter who only uses a brush to trace the border of a shape.
  • The 3D Guessers (Depth): Even though DINO only sees flat 2D pictures, it has a team of experts that understand depth. They look for shadows, perspective lines (like train tracks meeting in the distance), and blurry textures to guess how far away things are.

Part 2: The Shape of the Brain (How the concepts are arranged)

The researchers expected the 32,000 concepts to be scattered randomly, like marbles in a box. Instead, they found a very organized structure:

  • The "Antipodal" Pairs: Some concepts are opposites that live on the same line. Think of a single ruler where one end is "White Shirt" and the other end is "Black Shirt." They aren't two different directions; they are two ends of the same stick.
  • The "Register" Tokens: DINO has a few special "helper" tokens that don't look at specific parts of the image. Instead, they act like a weather report for the whole picture. They tell DINO things like, "The whole image is blurry," or "The lighting is dim," or "There is motion."
  • The Smooth Map: If you look at how DINO sees a single photo, the concepts don't jump around randomly. They flow smoothly, like a river. If you move from a cat's ear to its nose, the concepts change gradually, not abruptly.

Part 3: The Big Idea (The "Minkowski" Hypothesis)

This is the most creative part. The researchers propose a new way to understand DINO's brain, which they call the Minkowski Representation Hypothesis.

The Old Way (Linear): Imagine you are building a house. You have a pile of straight wooden planks (directions). To build a wall, you just stack the planks.

The New Way (Minkowski/Convex): Imagine you are building a house out of Lego blocks.

  • You have a "Cat" block, a "Brown" block, and a "Fluffy" block.
  • When DINO sees a brown, fluffy cat, it doesn't just point to a "cat" direction. It mixes these blocks together.
  • The final image is a blend (a convex mixture) of these archetypes.

The "Minkowski Sum" Analogy:
Imagine you have a bag of different shapes (polytopes).

  1. One bag has shapes for Position (Left, Right, Center).
  2. One bag has shapes for Object (Cat, Dog, Car).
  3. One bag has shapes for Lighting (Sunny, Dark).

DINO's brain doesn't pick one shape from each bag. It adds them together. The final result is a complex shape that is the sum of all those smaller shapes.

Why does this matter?

  • It explains the "Smoothness": Because DINO is mixing shapes, the transition from "cat" to "dog" is a smooth slide through the space between the shapes, not a jump between two lines.
  • It explains the "Limits": If you try to push DINO to "see" a cat by just turning up the "cat" dial, it eventually stops working. Why? Because you've pushed the shape so far it's no longer a valid "cat" shape anymore; it's broken. You can't just keep going in a straight line forever; you have to stay inside the "shape" of the concept.

The Takeaway

The paper tells us that AI vision isn't just a list of directions. It's a geometric city made of overlapping shapes.

  • Concepts are regions (like a neighborhood), not just points (like a street address).
  • To understand AI, we shouldn't just look for "lines"; we should look for shapes and how they blend together.

It's like realizing that to understand a painting, you shouldn't just look at the individual brushstrokes (the lines); you need to see how the colors blend to form the final image (the shapes).

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →