Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

This paper challenges the conventional wisdom of patchification in Vision Transformers by demonstrating through extensive experiments that reducing patch sizes down to 1x1 pixels consistently improves predictive performance across various tasks and architectures, enabling models to scale visual sequences to over 50,000 tokens while achieving competitive accuracy.

Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

Published 2026-02-23
📖 4 min read☕ Coffee break read

The Big Idea: Stop Squinting at the Picture

Imagine you are trying to understand a complex painting, like a masterpiece in a museum.

The Old Way (The "Patch" Method):
For the last few years, computer vision models (AI that sees) have looked at paintings by cutting them into big, chunky squares (like a 16x16 grid) and summarizing each square with a single word.

  • Analogy: Imagine you are describing a photo of a cat to a friend, but you are only allowed to look at the photo through a thick, blurry grid. You see a square that says "furry," another that says "orange," and another that says "tail." You miss the whiskers, the specific pattern of the fur, and the exact shape of the eyes. You are forced to guess the details.

The New Discovery:
The researchers in this paper asked: "What if we stop squinting? What if we look at every single pixel (the tiniest dot of color) as its own piece of information?"

They found a surprising rule: The more detail you give the AI, the smarter it gets. Even when they went from looking at big blocks down to looking at every single pixel, the AI got better at its job.


The Three Big Surprises

1. The "Pixel is a Token" Rule

In the past, AI models were designed to be efficient. They compressed images to save memory, much like zip-filing a folder to save space on your hard drive. But this paper shows that un-zipping the file actually makes the AI smarter.

  • The Analogy: Think of a jigsaw puzzle.
    • Old Way: You glue 16 puzzle pieces together into one big chunk and treat that chunk as a single piece. You lose the picture on the edges.
    • New Way: You treat every single puzzle piece as its own unique piece.
    • Result: The AI can solve the puzzle (recognize the image) much better because it has access to the fine details it was previously throwing away.

2. The "Decoder" Becomes Unnecessary

Usually, AI models have two parts:

  1. The Encoder: The brain that looks at the image.
  2. The Decoder: A special tool at the end that tries to "fix" the blurry details the brain missed.
  • The Analogy: Imagine a translator who speaks a foreign language but has a bad microphone (the Encoder). Usually, you need a second person (the Decoder) to sit next to them and whisper the missing words so the translation makes sense.
  • The Discovery: The researchers found that if you give the translator a perfect, crystal-clear microphone (by using tiny 1x1 patches instead of big blocks), they don't need the second person anymore! The "Decoder" becomes useless because the main brain is already seeing everything clearly. This simplifies the whole machine.

3. The "50,000 Token" Marathon

Because modern computer chips (GPUs) are getting incredibly powerful, the researchers were able to feed the AI a sequence of 50,176 tokens (pieces of information) for a single image.

  • The Analogy: Previously, an AI could only read a short summary of a book (196 words). Now, they made it read the entire book, word-for-word, without skipping a single sentence.
  • The Result: On a standard test (ImageNet), this "pixel-by-pixel" approach boosted the AI's accuracy to 84.6%, beating the old methods significantly.

Why Didn't We Do This Before?

You might ask, "If looking at every pixel is so great, why didn't we do it 5 years ago?"

The Answer: It was too heavy.

  • The Analogy: Imagine trying to carry a library of books in your backpack.
    • Old Hardware: Five years ago, our "backpacks" (computer memory) were small. We had to compress the books (images) into summaries just to fit them in.
    • New Hardware: Today, we have giant backpacks and super-fast trucks (modern GPUs and efficient algorithms). We can finally carry the whole library without crushing it.

The paper proves that the limitation wasn't the idea of looking at pixels; it was just the cost of doing so. Now that the cost has dropped, we can stop compressing our images.

The Takeaway

This paper suggests a new era for AI vision. Instead of forcing computers to "squint" at images to save energy, we should let them see everything.

  • Old Philosophy: "Compress the image to make it faster, even if we lose some details."
  • New Philosophy: "Give the AI every single detail. Let the hardware do the heavy lifting. The more we see, the smarter we get."

It's a shift from efficient guessing to perfect observation.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →