Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Imagine you have a brilliant Language Expert (a Large Language Model, or LLM) who can write poetry, tell jokes, and answer complex questions. But this expert is blind. To help them see the world, you hire a Visual Assistant (a Vision Encoder) to describe images to them.

For a long time, the industry standard for this Visual Assistant has been the Transformer (specifically the ViT family). Think of the Transformer as a highly organized librarian who looks at a picture by breaking it into tiny, uniform squares (patches) and then trying to understand how all those squares relate to each other at once. It's powerful, but it can sometimes get a bit "scattered" when trying to pinpoint exactly where something is in a photo.

This paper asks a simple but revolutionary question: "Do we really need the librarian? Can we hire a different kind of assistant instead?"

The authors propose hiring a State Space Model (SSM), specifically a model called VMamba. If the Transformer is a librarian, think of VMamba as a scout who walks through the image in a specific, winding path, constantly updating their mental map of the scene as they go.

Here is the breakdown of their findings, using simple analogies:

1. The "Matched" Test: Same Budget, Different Skills

The researchers set up a fair fight. They took the same Language Expert, the same training instructions, and the same image size. They just swapped the Visual Assistant.

The Result: The VMamba scout outperformed the Transformer librarian in almost every category.
The Surprise: The VMamba assistant was actually smaller (fewer parameters) than some of the top-performing Transformers, yet it did a better job. It was like hiring a lean, agile runner instead of a heavy, slow tank, and the runner won the race.

2. The "Localization" Problem: Finding the Needle

One of the hardest tasks for AI is localization. If you ask, "Where is the red ball?", the model needs to point to the exact spot, not just say "There is a ball."

The Transformer's Struggle: Because the Transformer looks at the whole picture at once, it sometimes gets the "gist" right but loses the specific location. It's like a tourist who knows they are in Paris but can't tell you exactly which street corner the Eiffel Tower is on.
VMamba's Strength: Because VMamba scans the image in a structured, directional way, it keeps a much sharper mental map of where things are. In the tests, VMamba was significantly better at pointing out specific objects and regions.

3. The "Bigger Isn't Better" Myth

Usually, in AI, if you make a model bigger and train it on more data, it gets smarter. The paper found that this rule breaks for Vision-Language Models.

The Analogy: Imagine training a student to take a test. If you force them to memorize the exact answer key for a specific practice test (ImageNet classification), they might get a perfect score on that test. But when you give them a real-world problem (like describing a messy room), they fail because they only learned to recognize the "category" of the object, not its shape or location.
The Finding: Some massive, high-scoring Transformer models actually performed worse at describing images than smaller ones. They had "over-specialized" on just recognizing what an object is, forgetting how to describe where it is.

4. The "Collapse" and the Fix

The researchers discovered a weird glitch. When they tried to train the Visual Assistants to be experts at detection (finding objects in a grid), some of them suddenly "collapsed" and became terrible at describing images.

Why? They found two bottlenecks:
1. The Transmission Pipe: The connector between the Visual Assistant and the Language Expert was too narrow to pass all the detailed spatial information.
2. The Shape Mismatch: The images were being fed in weird, stretched-out shapes (like a long rectangle), which confused the Language Expert.
The Fix: They didn't need to rebuild the whole system. They just:
- Widened the pipe (made the connector stronger).
- Squared the image (fed the image in a square shape instead of a rectangle).
- Result: The "collapsed" models suddenly started working perfectly again, often beating the original models.

The Big Takeaway

This paper suggests that the future of Vision-Language Models doesn't necessarily require bigger, more expensive Transformer models.

Instead, we should look at SSM models (like VMamba) because they are:

More efficient: They do more with less computing power.
Better at spatial reasoning: They know where things are, not just what they are.
Stable: With a few simple tweaks (like fixing the image shape), they are incredibly robust.

In short: The paper argues that we might have been using the wrong tool for the job. We don't need a giant, all-seeing eye; we need a smart, agile scout who knows the terrain. And that scout is the State Space Model.

1. Problem Statement

Large Vision-Language Models (VLMs) typically rely on a frozen vision backbone (usually a Vision Transformer, or ViT) to extract image features, which are then mapped to a Large Language Model (LLM) via a lightweight connector. While ViTs are the standard, the paper identifies several critical issues:

Architectural Monoculture: Most VLMs rely exclusively on Transformer-based encoders, leaving the potential of State Space Models (SSMs) unexplored in this context.
Confounding Variables: Previous comparisons often change multiple factors simultaneously (pretraining objective, resolution, tokenization, connector design), making it difficult to isolate the impact of the vision architecture itself.
Spatial Grounding Limitations: VLMs often struggle with spatially grounded reasoning (localization). Increasing image resolution or token counts to capture fine details incurs high computational costs. The authors ask if SSMs, which inherently process 2D grids via multi-directional scans, can encode richer spatial information more efficiently than ViTs.
Unreliable Metrics: It is unclear if standard ImageNet accuracy or naive model scaling reliably predicts VLM performance, particularly for grounding tasks.

2. Methodology

The authors conduct a controlled, modular evaluation following a LLaVA-style architecture where the vision encoder is frozen during instruction tuning.

Experimental Setup:
- Backbones: They compare VMamba (a pure SSM with 2D-Selective-Scan), MambaVision (hybrid SSM-Transformer), ViT, MaxViT (hybrid Conv-Transformer), and ViTDet/DeiT (dense-task adapted variants).
- Controlled Swaps: To isolate architectural effects, they strictly match the ImageNet-1K (IN1K) pretraining, input resolution (224x224), and visual token count ( $L=196$ ) across all families.
- Training: They use a one-stage instruction tuning recipe on 665K multimodal examples, freezing the vision encoder and training only the connector and the LLM (Vicuna-7B).
- Dense-Task Adaptation: They further evaluate backbones adapted for detection (COCO) and segmentation (ADE20K) to study the impact of dense pretraining objectives.
- Stabilization Experiments: When "localization collapse" (sharp performance drops) occurs, they test two stabilization strategies: (1) increasing connector capacity (3-layer MLP vs. 2-layer), and (2) modifying interface geometry (changing non-square detection inputs to square 512x512).
Evaluation:
- Benchmarks: Open-ended VQA (VQA-v2, GQA, etc.) and Localization/Grounding (RefCOCO, RefCOCO+, RefCOCOg, OCID-Ref).
- Metrics: Weighted averages across benchmarks, correlation analysis between VQA and localization scores, and linear probing for feature quality.

3. Key Contributions

Systematic SSM Evaluation: The first controlled comparison of SSM-based vision encoders (VMamba) against Transformer families in a frozen VLM setting.
Discovery of "Localization Collapse": Identification of a failure mode where detection-adapted backbones (both ViT and SSM) suffer sharp drops in grounding performance, which is not predicted by ImageNet accuracy.
Stabilization Strategies: Demonstration that simple, architecture-agnostic fixes—specifically increasing connector capacity and standardizing input geometry to square resolutions—can recover performance in collapsed models.
Decoupling Metrics: Evidence that ImageNet top-1 accuracy and model size are unreliable predictors of VLM performance, particularly for spatial reasoning.

4. Key Results

A. Matched IN1K/224 Setting (Strict Control)

VMamba Dominance: Under strictly matched conditions (224x224, 196 tokens), VMamba variants achieve the strongest overall performance.
Grounding Superiority: VMamba-T and VMamba-S consistently outperform all ViT and MaxViT variants on localization benchmarks (RefCOCO series), often by a significant margin (e.g., VMamba-S achieves ~59% overall vs. ~51% for ViT-B).
VQA Performance: VMamba remains highly competitive on open-ended VQA, matching or exceeding larger Transformer backbones despite having fewer parameters.
Scaling Paradox: For ViT and MaxViT, larger models (e.g., MaxViT-L) with higher ImageNet accuracy often yield worse VLM localization performance. VMamba shows more robust scaling behavior at smaller sizes.

B. Dense-Task Adaptation (Detection/Segmentation)

Benefit of Dense Objectives: Adapting backbones for detection or segmentation generally improves both VQA and localization for both SSM and Transformer families.
Localization Collapse: Certain configurations (e.g., ViTDet-L/H and VMamba-T/B with detection pretraining) exhibit "localization collapse," where grounding scores drop drastically.
Stabilization Success:
- Connector Capacity: Increasing the connector to a 3-layer MLP recovers performance in some collapsed cases (e.g., ViTDet-L).
- Geometry Fix: Changing the input geometry from the non-square resolution used in detection pretraining (e.g., 1333x800) to a square resolution (512x512) eliminates the collapse entirely and often boosts performance above the ImageNet-pretrained baseline.
- Combined Effect: Using both a stronger connector and square geometry yields the most consistent recovery.

C. Analysis & Diagnostics

Correlation: Localization benchmarks are highly correlated with general VQA performance (especially GQA and VQA-v2), suggesting that spatial grounding is a prerequisite for high-quality VQA.
Failure Modes: The collapse is attributed to a vision-language interface failure (transmission or utilization bottleneck) rather than a lack of spatial information in the encoder. The LLM fails to utilize spatial cues when the input geometry is irregular or the connector is too weak.
Inductive Bias: VMamba's 2D-selective-scan architecture inherently preserves spatial structure better than ViTs under standard classification pretraining, making it more robust to the "over-specialization" to global categories that plagues large ViTs.

5. Significance and Implications

SSMs as Viable Alternatives: The paper establishes SSMs (specifically VMamba) as a strong, size-efficient alternative to ViTs for VLMs, offering superior spatial grounding without increasing token counts.
Rethinking VLM Design: The findings suggest that VLM performance is a product of three factors: Backbone Architecture + Pretraining Objective + Interface Stability.
- A strong backbone (SSM) + dense objective + stable interface (square geometry, strong connector) yields the best results.
- Even a strong backbone can fail if the interface geometry or connector capacity is mismatched.
Practical Guidance: The paper provides actionable advice for deploying VLMs:
1. Do not rely solely on ImageNet accuracy or model size to select a backbone.
2. When using detection-adapted backbones, ensure the input geometry is square (or stabilized) and the connector is sufficiently expressive to prevent localization collapse.
3. SSMs offer a promising path for efficient, high-performance VLMs that require less computational overhead for spatial reasoning.

In conclusion, the authors demonstrate that VLMs do not strictly need Vision Transformers; State Space Models can provide superior spatial grounding and competitive VQA performance, provided the vision-language interface is properly stabilized.