Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Imagine you are trying to teach a very smart robot (a Large Language Model) how to "see" and understand the world, not just read text. This robot is great at chatting, but when you show it a complex image—like a crowded street scene, a detailed medical chart, or a document with tiny handwriting—it often gets confused or misses the details.

This paper introduces a new robot named Leo and explains how the researchers built a better "eye" for it.

Here is the story of how they did it, using simple analogies.

The Problem: One Eye Isn't Enough

Previously, these robots used a single "vision expert" (a pre-trained computer program) to look at images. It's like asking a generalist doctor to perform heart surgery, read a legal contract, and analyze a fingerprint all at once. They are good, but they miss the fine details.

To fix this, other researchers tried using multiple experts (a "Mixture of Vision Encoders"). Imagine hiring a team: a dermatologist for skin, an optometrist for eyes, and a cardiologist for the heart. But here was the problem: How do you get these experts to talk to each other?

Do you make them shout their findings over each other?
Do you make them write a long report and paste them together?
Do you make them sit in a circle and discuss every single detail?

Most previous methods were messy, slow, or lost important details in the process.

The Solution: The "Leo" Recipe

The researchers, Mozhgan and her team, ran a series of experiments to find the perfect way to combine these experts. They discovered a "lightweight recipe" that works like magic. They call their new robot Leo.

Here are the three secret ingredients of Leo's success:

1. The "Puzzle Piece" Strategy (Dynamic Tiling)

The Old Way: Trying to look at a giant, high-resolution photo (like a 4K movie poster) all at once is like trying to swallow a whole watermelon in one bite. You choke, or you miss the seeds.
Leo's Way: Leo cuts the image into puzzle pieces (tiles) based on the shape of the picture. If the image is a tall skyscraper, he cuts it into tall strips. If it's a wide landscape, he cuts it into wide slices.

The Analogy: Instead of staring at the whole forest at once, Leo looks at one tree, then the next, then the next, but he also keeps a tiny "thumbnail" photo of the whole forest in his pocket so he never loses track of the big picture. This lets him see tiny details (like a bird on a branch) without getting overwhelmed.

2. The "Braided Hair" Strategy (Token Interleaving)

The Old Way: When the experts (the vision encoders) send their notes to the robot's brain, previous methods would just stack them. Expert A writes a long list, then Expert B writes a long list. The brain has to read A's whole list before understanding B's point. It's like reading two separate books and trying to compare them only after finishing both.
Leo's Way: Leo takes the notes from Expert A and Expert B and braids them together, like hair.

The Analogy: Instead of "Expert A says X, Y, Z... then Expert B says 1, 2, 3...", Leo says "Expert A says X, Expert B says 1, Expert A says Y, Expert B says 2."
Why it works: This allows the robot's brain to instantly compare the two experts' thoughts side-by-side for every single part of the image. It creates a much richer, more balanced understanding.

3. The "Specialized Translators" Strategy (Post-Adaptation Fusion)

The Old Way: Imagine Expert A speaks "Medical" and Expert B speaks "Legal." In old systems, you forced them to translate their thoughts into "Robot Language" before they could talk to each other. This often made them lose their unique vocabulary.
Leo's Way: Leo gives each expert their own personal translator.

The Analogy: Expert A translates their "Medical" notes into "Robot Language" perfectly. Expert B translates their "Legal" notes into "Robot Language" perfectly. Only then does Leo let them mix their notes together.
Why it works: This ensures that the unique, specialized knowledge of each expert is preserved before they are combined. It's like having two chefs prepare their own ingredients perfectly before mixing them into one final dish, rather than mixing the raw ingredients first and hoping for the best.

The Results: Leo is a Super-Student

The researchers tested Leo on 11 different challenges, from reading tiny text on a license plate (OCR) to understanding complex math charts and even driving a car.

Better than the rest: Leo beat almost every other robot that uses multiple vision experts, even though Leo uses less data and fewer computer resources to train.
The "Driver" Test: They even tested Leo in the world of self-driving cars. Without changing a single line of code, Leo could look at a road, see a pedestrian, and decide, "I need to stop." It proved that Leo isn't just a lab experiment; it can handle real-world, messy situations.

The Bottom Line

The paper teaches us that you don't need to build a massive, expensive, complicated robot to see better. Sometimes, you just need to organize your team better.

By cutting images into smart puzzle pieces, braiding the experts' thoughts together, and letting them translate their own ideas before mixing, Leo became a master of visual understanding. It's a reminder that in AI, how you combine information is often more important than how much information you have.

Here is a detailed technical summary of the paper "Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs".

1. Problem Statement

Multimodal Large Language Models (MLLMs) have achieved significant success by aligning vision encoders with LLMs. However, they still struggle with fine-grained visual understanding tasks, such as complex Optical Character Recognition (OCR), chart analysis, and scene understanding, which require processing high-resolution inputs.

Existing approaches to enhance visual capabilities generally fall into two categories:

Scaling: Increasing model parameters or pretraining data for vision encoders.
Mixture of Vision Encoders (MoVE): Integrating multiple pretrained vision experts to leverage complementary strengths.

While MoVE shows promise, current implementations face unresolved challenges:

Fusion Granularity: It is unclear at what level (sequence, channel, or tile) different encoders should be merged.
Fusion Timing: It is uncertain whether fusion should occur before (pre-adaptation) or after (post-adaptation) projecting visual tokens into the LLM's latent space.
High-Resolution Handling: Combining MoVE with high-resolution tiling strategies (to preserve detail) has not been systematically studied.
Complexity vs. Gain: Many existing MoVE models use complex routing or cross-attention mechanisms, but it is unclear if these yield better results than simpler strategies.

2. Methodology

The authors conducted a systematic empirical study to identify optimal design principles for MoVE-based MLLMs, leading to the proposal of Leo, a lightweight architecture. The study focused on three core investigative directions (D1, D2, D3):

D1: Tiled MoVE (Interaction of Tiling and MoVE)

The authors investigated how different tiling strategies interact with multiple vision encoders.

Strategies Tested: No-tiling, Fixed-grid, Overlapping, and Dynamic Tiling.
Finding: Dynamic tiling (adapting tile number/shape to image aspect ratio while maintaining a global context thumbnail) combined with MoVE yielded the best results. It preserves fine-grained details without exceeding the LLM's context length, outperforming fixed-grid and overlapping methods.

D2: Token Merging Strategies

The study compared four methods for merging tokens from two encoders at the tile level:

Sequence Appending (SA): Concatenating token sequences.
Sequence Interleaving (SI): Alternating tokens from Encoder 1 and Encoder 2 (e.g., $[t_1^A, t_1^B, t_2^A, t_2^B...]$ ).
Channel Concatenation (CC): Merging feature channels at each token position.
Cross-Attention (CA): Using one encoder's tokens to query the other.

Finding: Tile-level Sequence Interleaving (SI) consistently outperformed other methods. It effectively preserves spatial relationships while ensuring balanced information integration, often surpassing complex cross-attention mechanisms.

D3: Fusion Timing (Pre- vs. Post-Adaptation)

The authors compared merging tokens before alignment with the LLM (Pre-adaptation) versus after independent projection (Post-adaptation).

Pre-adaptation: Merges raw encoder outputs, then uses a single shared projector.
Post-adaptation: Each encoder has a dedicated independent projector to align its tokens to the LLM space before merging.
Finding: Post-adaptation fusion with independent projectors significantly outperformed pre-adaptation. This approach preserves encoder-specific features (e.g., SAM's segmentation cues vs. InternViT's semantic understanding) before they are normalized and fused.

3. The Leo Architecture

Based on the findings above, the authors constructed Leo, a lightweight MoVE-based MLLM with the following components:

Input Processing: Dynamic tiling with global context (preserving high-resolution details).
Encoders: Two complementary vision encoders (e.g., InternViT for general vision-language alignment and SAM for fine-grained region/segmentation features).
Projection: Two independent MLP projectors (one per encoder) to map visual embeddings to the LLM token space (Post-adaptation).
Fusion: Tile-level sequence interleaving to merge the projected tokens.
LLM Backbone: A standard 7B LLM (InternLM2-7B) for reasoning.

Key Design Philosophy: Leo avoids complex routing or heavy parameter scaling, relying instead on principled architectural choices (tiling, interleaving, independent projection) to maximize performance.

4. Key Results

Leo was evaluated on 11 vision-language benchmarks and compared against state-of-the-art MoVE models (e.g., Eagle, Brave, SPHINX) and general MLLMs.

Benchmark Performance:
- Leo achieved state-of-the-art or competitive results on the majority of tasks, particularly excelling in DocVQA (80.1), ScienceQA (78.5), and ChartQA.
- It outperformed models with significantly larger pretraining datasets (e.g., Eagle-X3 with 1.8M SFT data vs. Leo's 1M) and more complex fusion mechanisms.
- Leo demonstrated superior performance in OCR and reasoning-intensive tasks compared to models using cross-attention or channel concatenation.
Efficiency:
- Leo uses only 612M vision encoder parameters (less than half of Eagle-X3).
- It achieved a 61.6% reduction in FLOPs and a 19.6% decrease in generation time compared to Eagle-X3, despite using SAM (which lacks FlashAttention support).
Domain Adaptation (Autonomous Driving):
- Leo was applied to the LingoQA benchmark (autonomous driving VQA) without modifying its architecture or training recipe.
- It surpassed all open-source baselines and matched the closed-source LingoQA baseline on key metrics (Lingo-J, BLUE, METEOR, CIDEr), demonstrating strong generalizability.
Ablation Insights:
- Freezing Encoders: Unlike some prior works that require fine-tuning encoders, Leo achieved its best performance (67.5 avg) when both vision encoders were frozen during Supervised Fine-Tuning (SFT). This suggests that the lightweight projectors are sufficient to adapt high-quality pretrained representations.
- Data Efficiency: Increasing SFT data from 1M to 1.8M yielded negligible gains, proving Leo's efficiency.

5. Significance and Contributions

Systematic Re-evaluation of MoVE: The paper moves beyond isolated design choices to provide a principled understanding of how tiling, token merging, and fusion timing interact.
Simplicity and Effectiveness: It demonstrates that simple, lightweight design choices (independent projectors, sequence interleaving, dynamic tiling) can outperform complex, parameter-heavy architectures.
Generalizability: The architecture is shown to be robust across diverse domains, from document understanding to autonomous driving, without requiring domain-specific architectural changes.
Resource Efficiency: Leo sets a new standard for efficiency, achieving high performance with fewer parameters and less training data, making high-resolution visual reasoning more accessible.

In conclusion, the paper argues that the future of MoVE-based MLLMs lies not in scaling up complexity, but in optimizing the integration strategy of complementary encoders through principled, lightweight architectural designs.