Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

This paper demonstrates that State Space Models (SSMs) can serve as strong, efficient alternatives to traditional Vision Transformers as visual backbones in Vision-Language Models, achieving competitive performance in VQA and grounding tasks while operating at a smaller scale and benefiting from stabilization strategies.

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

Published 2026-03-20
📖 5 min read🧠 Deep dive

Imagine you have a brilliant Language Expert (a Large Language Model, or LLM) who can write poetry, tell jokes, and answer complex questions. But this expert is blind. To help them see the world, you hire a Visual Assistant (a Vision Encoder) to describe images to them.

For a long time, the industry standard for this Visual Assistant has been the Transformer (specifically the ViT family). Think of the Transformer as a highly organized librarian who looks at a picture by breaking it into tiny, uniform squares (patches) and then trying to understand how all those squares relate to each other at once. It's powerful, but it can sometimes get a bit "scattered" when trying to pinpoint exactly where something is in a photo.

This paper asks a simple but revolutionary question: "Do we really need the librarian? Can we hire a different kind of assistant instead?"

The authors propose hiring a State Space Model (SSM), specifically a model called VMamba. If the Transformer is a librarian, think of VMamba as a scout who walks through the image in a specific, winding path, constantly updating their mental map of the scene as they go.

Here is the breakdown of their findings, using simple analogies:

1. The "Matched" Test: Same Budget, Different Skills

The researchers set up a fair fight. They took the same Language Expert, the same training instructions, and the same image size. They just swapped the Visual Assistant.

  • The Result: The VMamba scout outperformed the Transformer librarian in almost every category.
  • The Surprise: The VMamba assistant was actually smaller (fewer parameters) than some of the top-performing Transformers, yet it did a better job. It was like hiring a lean, agile runner instead of a heavy, slow tank, and the runner won the race.

2. The "Localization" Problem: Finding the Needle

One of the hardest tasks for AI is localization. If you ask, "Where is the red ball?", the model needs to point to the exact spot, not just say "There is a ball."

  • The Transformer's Struggle: Because the Transformer looks at the whole picture at once, it sometimes gets the "gist" right but loses the specific location. It's like a tourist who knows they are in Paris but can't tell you exactly which street corner the Eiffel Tower is on.
  • VMamba's Strength: Because VMamba scans the image in a structured, directional way, it keeps a much sharper mental map of where things are. In the tests, VMamba was significantly better at pointing out specific objects and regions.

3. The "Bigger Isn't Better" Myth

Usually, in AI, if you make a model bigger and train it on more data, it gets smarter. The paper found that this rule breaks for Vision-Language Models.

  • The Analogy: Imagine training a student to take a test. If you force them to memorize the exact answer key for a specific practice test (ImageNet classification), they might get a perfect score on that test. But when you give them a real-world problem (like describing a messy room), they fail because they only learned to recognize the "category" of the object, not its shape or location.
  • The Finding: Some massive, high-scoring Transformer models actually performed worse at describing images than smaller ones. They had "over-specialized" on just recognizing what an object is, forgetting how to describe where it is.

4. The "Collapse" and the Fix

The researchers discovered a weird glitch. When they tried to train the Visual Assistants to be experts at detection (finding objects in a grid), some of them suddenly "collapsed" and became terrible at describing images.

  • Why? They found two bottlenecks:

    1. The Transmission Pipe: The connector between the Visual Assistant and the Language Expert was too narrow to pass all the detailed spatial information.
    2. The Shape Mismatch: The images were being fed in weird, stretched-out shapes (like a long rectangle), which confused the Language Expert.
  • The Fix: They didn't need to rebuild the whole system. They just:

    • Widened the pipe (made the connector stronger).
    • Squared the image (fed the image in a square shape instead of a rectangle).
    • Result: The "collapsed" models suddenly started working perfectly again, often beating the original models.

The Big Takeaway

This paper suggests that the future of Vision-Language Models doesn't necessarily require bigger, more expensive Transformer models.

Instead, we should look at SSM models (like VMamba) because they are:

  1. More efficient: They do more with less computing power.
  2. Better at spatial reasoning: They know where things are, not just what they are.
  3. Stable: With a few simple tweaks (like fixing the image shape), they are incredibly robust.

In short: The paper argues that we might have been using the wrong tool for the job. We don't need a giant, all-seeing eye; we need a smart, agile scout who knows the terrain. And that scout is the State Space Model.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →