Separators in Enhancing Autoregressive Pretraining for Vision Mamba

The paper introduces STAR, a novel autoregressive pretraining method that utilizes image separators to quadruple the input sequence length of Vision Mamba, achieving a competitive 83.5% accuracy on ImageNet-1k by effectively leveraging long-range dependencies.

Hanpeng Liu, Zidan Wang, Shuoxi Zhang, Kaiyuan Gao, Kun He

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a very smart, fast robot how to understand pictures. This robot, named Mamba, is a special kind of AI that is incredibly good at reading long stories (like books or sentences) because it processes information in a straight line, one word after another.

However, there's a problem: Pictures aren't stories. A picture is a 2D grid of pixels. If you try to feed a whole picture to Mamba as a single, short list of words, the robot gets bored and doesn't learn much. It's like asking a marathon runner to sprint only 10 meters; they have the legs to run a mile, but you're only giving them a tiny track.

The researchers in this paper, led by Hanpeng Liu, came up with a clever trick to fix this. They call their method STAR (SeparaTors for AutoRegressive pretraining).

Here is how it works, broken down into simple analogies:

1. The Problem: The "Short Story" Trap

Usually, when we teach AI to understand images, we show it one picture at a time. We chop the picture into tiny puzzle pieces (patches) and ask the AI to guess the next piece based on the previous ones.

  • The Issue: Even with a big picture, the "story" is still too short for Mamba. It's like reading a single sentence and expecting to understand the whole novel. Mamba's superpower is handling long sequences, but standard image tasks don't give it enough length to shine.

2. The Solution: The "Bookshelf" Analogy

The researchers realized: Why not feed Mamba a whole shelf of books at once instead of just one page?

They decided to take many different pictures and stack them into one giant, long sequence. But if you just glue pictures together, the robot gets confused. It might think the bottom of a cat's tail is the top of a car's wheel.

Enter the "Separator" (The STAR):
To fix this, they invented a special "bookmark" or "chapter divider."

  • Imagine you are reading a book of short stories. Between every story, there is a blank page with a special symbol (like a star ⭐).
  • In the computer world, this "star" is a specific pattern of zeros and ones that looks nothing like a real picture.
  • How it works: They take 8 different images, put a "Star Bookmark" in front of each one, and glue them all together into one massive, long chain.
    • Image 1 -> Star -> Image 2 -> Star -> Image 3...

Now, Mamba can run its "marathon." It reads through the long chain, learning that "Oh, when I see a Star, a new picture is starting." This allows it to use its full power to understand long-range connections.

3. The "Class Token" Switch

In many AI models, there is a special "summary token" (like a title card) that tells the model what the whole image is about.

  • Old way: The title card was stuck in the middle of the story.
  • STAR way: The researchers moved the title card to the very end of the story.
  • Why? Since the robot has now read the entire long chain of images, it has a much better understanding of the context. Putting the summary at the end lets it use all that new information to make a smarter guess about what the image is.

4. The Results: A Faster, Smarter Runner

By using this "long chain of pictures with bookmarks" method:

  • Efficiency: They trained the model much faster than other methods (like Contrastive Learning or MAE). It's like the robot learned the same amount of knowledge in 1/6th of the time.
  • Performance: The model became incredibly accurate at recognizing objects (like cats, cars, or dogs). It reached 83.5% accuracy on a standard test (ImageNet), which is competitive with the best models out there, even though it uses a simpler, lighter architecture.

Summary

Think of STAR as teaching a robot to read by giving it a whole library of books glued together with special dividers, instead of just handing it a single page.

  1. Dividers (Separators): Keep the stories (images) distinct so the robot doesn't get confused.
  2. Long Sequence: Lets the robot use its superpower (handling long data) to learn better.
  3. Moving the Title: Lets the robot use the full context of the story to make a final, smart guess.

The result is a vision model that is faster to train, uses less computing power, and understands the visual world just as well as the heavy, slow giants.