MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Imagine you have a very smart, but slightly overwhelmed, assistant (the AI) who is trying to understand a picture you just showed them.

In the world of Vision-Language Models (VLMs), this assistant doesn't just "see" the picture as a whole image. Instead, it breaks the image down into thousands of tiny puzzle pieces called "vision tokens."

The Problem: Too Much Clutter

Think of these tokens like a room filled with 2,880 tiny sticky notes, each describing a tiny part of the image. If you ask your assistant, "What is the dog doing?", it has to read through all 2,880 sticky notes to find the few that actually talk about the dog.

This is slow and inefficient. It's like trying to find a specific needle in a haystack by reading every single piece of straw. Most current methods try to throw away some sticky notes to speed things up, but they often do it blindly:

Method A might just throw away the notes that look "boring" (low attention).
Method B might just look at the text you typed and try to guess which notes match, ignoring the rest of the picture.

The problem is that these methods are unimodal (using only one type of information). They miss the big picture.

The Solution: MMTok (The Smart Librarian)

The authors of this paper, MMTok, propose a new way to clean up the room. They act like a super-efficient librarian who uses two clues at once to decide which sticky notes to keep:

The Text Clue (What you asked): "I want to know about the dog."
The Visual Clue (What's in the room): "Even if you didn't ask about the dog, there are other important things in this picture, like the tree or the sky, that give context."

The "Coverage" Analogy

Imagine you are packing a suitcase for a trip, but you can only fit 4 items (tokens) instead of the usual 2,880.

Old methods might just pick the 4 items that match your shopping list (the text). If you asked for "socks," they pack socks. But if you forgot to mention "shoes," they might leave the shoes behind, even though they are crucial for the trip.
MMTok uses a strategy called Maximum Coverage. It asks two questions simultaneously:
1. "Do these 4 items cover everything I asked for?" (Text-Vision Coverage)
2. "Do these 4 items cover the most important parts of the entire room?" (Vision-Vision Coverage)

It picks the 4 items that are the best "representatives" of the whole room and answer your specific question. It's like picking a "highlight reel" of the image that tells the whole story, not just the part you explicitly mentioned.

How It Works (The Magic Trick)

The paper describes a mathematical "greedy algorithm" (a step-by-step recipe).

It looks at all the sticky notes.
It picks the one that helps answer your question and represents the image best.
It picks the next one that fills in the biggest gaps left by the first one.
It repeats this until it has the perfect small team of tokens.

Because this process is based on a mathematical concept called "submodularity," the computer can find a near-perfect solution very quickly, without needing to retrain the AI model from scratch.

The Results: Fast and Accurate

The researchers tested this on several famous AI models (like LLaVA and Qwen) and found:

Speed: They could cut the number of sticky notes from 2,880 down to just 64 (or even 4 in some cases) and the AI still understood the image almost perfectly.
Performance: On some tests, they got 98.7% of the original performance while using only a tiny fraction of the data.
Efficiency: It made the AI run 1.87 times faster on a specific dataset (POPE).

The Takeaway

Think of MMTok as a filter that stops the AI from drowning in data. Instead of reading the whole encyclopedia to answer a simple question, it learns to read just the right few pages that contain the answer and the necessary context.

By combining what you asked with what is visually important, MMTok makes AI vision faster, cheaper, and just as smart, without needing to teach the AI anything new. It's the difference between reading a whole book to find a quote versus having a smart index that points you to the exact page instantly.

1. Problem Statement

Vision-Language Models (VLMs) convert visual inputs into a large number of "vision tokens" to leverage Large Language Models (LLMs) for understanding. However, this creates a significant bottleneck:

Redundancy: Vision tokens are highly redundant compared to text tokens. For instance, a single image in LLaVA-NeXT can generate 2,880 vision tokens for a query with fewer than 10 text tokens.
Inefficiency: LLMs rely on self-attention mechanisms with quadratic computational complexity relative to the total token count. The sheer volume of vision tokens severely degrades inference speed and increases memory usage.
Limitations of Existing Methods: Current token pruning strategies are largely unimodal.
- Vision-only methods (e.g., FastV, VisionZip, DivPrune) rely on attention scores or diversity within the image, often ignoring the specific query.
- Text-only methods (e.g., SparseVLM) use text-to-vision attention but ignore the global structure of the image.
- Gap: These approaches fail to capture the inherent multimodal nature of VLM tasks, where the relevance of an image region depends on both the visual content and the specific text instruction.

2. Methodology: MMTok

The authors propose MMTok, a training-free framework that selects an informative subset of vision tokens by maximizing multimodal coverage. The core idea is to formulate token selection as a Maximum Coverage Problem.

Core Framework

The method aims to select a subset $S$ of vision tokens that maximizes the coverage of two target sets simultaneously:

Text Tokens: Ensuring selected vision tokens are semantically relevant to the user's query.
Original Vision Tokens: Ensuring the selected subset preserves the global information of the entire image.

Technical Components

The framework operates in three main stages:

A. Maximum Text-Vision Coverage ( $M^{tv}$ )

Goal: Select vision tokens that best cover the semantics of the text query.
Similarity: Computes a similarity matrix between text tokens ( $t$ ) and vision tokens ( $v$ ) after the projection layer (to align with the LLM's input space).
Formula: $M^{tv}_{i,j} = t_i^\top v_j$ .

B. Maximum Vision-Vision Coverage ( $M^{vv}$ )

Goal: Select vision tokens that best represent the entire image (handling vague queries like "Describe the image").
Similarity: Computes a similarity matrix between vision tokens before the projection layer ( $v'$ ) to capture pure visual relationships without text interference.
Formula: $M^{vv}_{i,j} = (v'_i)^\top v'_j$ .

C. Multimodal Fusion and Optimization

Calibration: Since $M^{tv}$ and $M^{vv}$ have different scales and distributions, they are normalized using Softmax with temperature parameters ( $\tau_t, \tau_v$ ) to create calibrated matrices $M^{tv'}$ and $M^{vv'}$ .
Objective Function: The final objective is a weighted sum of the two coverage functions:
$f(S; M^{tv'}, M^{vv'}) = f(S; M^{tv'}) + \alpha f(S; M^{vv'})$
Where $\alpha$ balances the importance of text relevance vs. visual completeness.
Theoretical Guarantee: The objective function is proven to be submodular. While the problem is NP-hard, a simple greedy algorithm (Algorithm 2) guarantees a solution within $(1 - 1/e) \approx 63\%$ of the optimal solution.
Complexity: The greedy selection runs in $O(kn)$ time, where $k$ is the number of tokens to select and $n$ is the total tokens, making it highly efficient.

3. Key Contributions

Multimodal Coverage Formulation: First to formulate vision token selection as a maximum coverage problem that explicitly combines text-vision and vision-vision similarities.
Theoretical Efficiency: Leverages submodularity to provide a theoretical guarantee for the greedy selection algorithm, ensuring near-optimal performance with low computational overhead.
Training-Free Paradigm: Unlike methods requiring fine-tuning or additional training, MMTok operates purely on pre-trained models, making it plug-and-play.
Complementarity: Demonstrates that text-guided selection and visual-diversity selection are complementary; combining them yields superior results compared to unimodal baselines.

4. Experimental Results

The method was evaluated on diverse benchmarks (MMBench, POPE, MME, GQA, etc.) across multiple VLM architectures (LLaVA-1.5, LLaVA-NeXT, Qwen-2.5-VL).

Performance Preservation:
- On LLaVA-NeXT-13B, MMTok achieved a 1.87× speedup while maintaining 98.7% of the original performance on the POPE dataset.
- On LLaVA-1.5-7B, with only 4 vision tokens (reducing 576 to 4), the method preserved 87.7% of the original performance.
- Across various token budgets (64, 128, 192 tokens), MMTok consistently outperformed state-of-the-art baselines (VisionZip, DivPrune, SparseVLM) by clear margins (e.g., +1.8% over DivPrune at 64 tokens).
High-IC (Image Contribution) Tasks:
- On tasks where the image is critical (High-IC datasets like POPE and MME), MMTok maintained ~80% performance with only 2 vision tokens, significantly outperforming unimodal methods which dropped to ~40-50%.
Efficiency:
- Inference Time: Reduced inference time by ~50% on POPE.
- Memory: Reduced running-time memory usage by >58% compared to the baseline.
- Overhead: The token selection process itself is negligible (<7ms even for 2880 tokens).
Robustness:
- Effective on Qwen-2.5-VL-7B, a model that already uses dynamic resolution and token merging, proving that even highly compressed tokens contain further redundancy.
- Effective in multi-turn conversations, where vision-vision coverage ensures selected tokens remain relevant for follow-up questions not explicitly mentioned in the first prompt.

5. Significance

MMTok addresses a critical bottleneck in the deployment of VLMs: the computational cost of processing high-resolution images. By recognizing that token selection is inherently a multimodal problem, the paper shifts the paradigm from unimodal pruning to multimodal coverage.

Practical Impact: It enables real-time VLM inference on edge devices or high-throughput servers by drastically reducing token counts without retraining models.
Theoretical Insight: It validates that a greedy approach to submodular maximization is sufficient for complex multimodal selection tasks, offering a strong theoretical foundation for future efficient VLM architectures.
Future Direction: The authors suggest that "hardness-aware" selection (adapting token count based on question difficulty) and using lightweight agents to generate richer text targets are promising future directions.

In summary, MMTok provides a mathematically grounded, efficient, and highly effective solution for accelerating VLM inference by intelligently selecting the most informative vision tokens based on both the image content and the user's intent.