SPoT: Subpixel Placement of Tokens in Vision Transformers

The paper introduces SPoT, a novel tokenization strategy that positions Vision Transformer tokens continuously within images via an oracle-guided search to overcome grid-based limitations, significantly reducing the required token count while improving performance and interpretability.

Martine Hjelkrem-Tan, Marius Aasan, Gabriel Y. Arteaga, Adín Ramírez Rivera

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are trying to describe a complex painting to a friend over the phone, but you are only allowed to use 25 words to describe the whole thing.

In the world of AI, this is exactly what happens when we try to make Vision Transformers (the "brains" that let computers see) faster and more efficient. We want them to look at an image and pick out only the most important parts (the "sparse" parts) to save time and energy.

However, current AI models have a weird, rigid rule: They can only look at the image through a fixed grid, like a window with square panes.

The Problem: The "Fork in the Soup" Analogy

The authors of this paper compare the current method to trying to eat soup with a fork.

  • The Soup: The image (which is continuous and fluid).
  • The Fork: The AI's rigid grid of square patches.

If the most important part of the soup (say, a delicious meatball) is sitting right on the edge of a fork tine, the fork can't scoop it up properly. It either misses the meatball entirely, or it scoops up half the meatball and half the empty space next to it. The AI is forced to choose whole "tiles" of the image, even if the best information is hiding in the cracks between them. This forces the AI to make awkward compromises, wasting its limited "word count" (tokens) on empty space just to catch the important bits.

The Solution: SPoT (Subpixel Placement of Tokens)

The paper proposes a new method called SPoT. Instead of forcing the AI to look through a fixed grid, SPoT lets the AI place its "eyes" (tokens) anywhere on the image, even in the tiny spaces between pixels (subpixels).

Think of it like switching from a fork to a laser pointer.

  • With a fork (the old way), you have to scoop a whole square.
  • With a laser pointer (SPoT), you can point exactly at the center of the meatball, or exactly at the edge of a clock hand, or right on a ladybug's spot. You don't care about the grid; you care about the feature.

How They Tested It: The "Oracle"

To prove this works, the researchers used a clever trick they call an Oracle.
Imagine a super-smart guide (the Oracle) who can look at the image and say, "If you could move your laser pointers anywhere in the world, where would you put them to get the perfect score?"

They found that when the AI is allowed to move its "laser pointers" to these perfect, continuous spots:

  1. It gets much smarter: Even with only 12.5% of the usual number of "words" (tokens), the AI performs better than standard models using the full grid.
  2. It finds the hidden gems: The AI naturally moves its tokens to the most interesting parts of the image (like the face of a person or the wheels of a car) rather than just filling up a grid.

Key Discoveries

  • Sparse is Strategic: In the past, having fewer tokens was seen as a limitation. SPoT turns this into a superpower. By being flexible, the AI can do more with less.
  • Context Matters: When the AI has very few tokens, it should focus on the main object (like a cat). But if it has many tokens, it's better to spread them out evenly to see the whole picture (like the background and the cat together).
  • It's Not Just About "Shiny" Objects: The researchers thought the AI would just point at the brightest or most colorful parts. Surprisingly, the AI often places tokens near the object to understand the context, not just on the object. It's like looking at a person's face, but also glancing at their hands to understand what they are doing.

The Big Picture

This paper suggests that we don't need to force AI to see the world in a rigid, pixelated grid. By letting the AI "drift" its focus to the exact right spots, we can build models that are:

  • Faster: They process fewer pieces of data.
  • Smarter: They don't waste energy on empty space.
  • More Flexible: They can adapt to whatever is actually important in the image.

In short, SPoT stops the AI from trying to eat soup with a fork and gives it a laser pointer instead, allowing it to see the world with much greater precision and efficiency.