MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

MMTok is a novel method that enhances the inference efficiency of Vision-Language Models by formulating vision token selection as a maximum coverage problem, leveraging complementary multimodal information from both vision and text to prune redundant tokens while preserving high performance.

Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

Published 2026-03-04
📖 4 min read☕ Coffee break read

Imagine you have a very smart, but slightly overwhelmed, assistant (the AI) who is trying to understand a picture you just showed them.

In the world of Vision-Language Models (VLMs), this assistant doesn't just "see" the picture as a whole image. Instead, it breaks the image down into thousands of tiny puzzle pieces called "vision tokens."

The Problem: Too Much Clutter

Think of these tokens like a room filled with 2,880 tiny sticky notes, each describing a tiny part of the image. If you ask your assistant, "What is the dog doing?", it has to read through all 2,880 sticky notes to find the few that actually talk about the dog.

This is slow and inefficient. It's like trying to find a specific needle in a haystack by reading every single piece of straw. Most current methods try to throw away some sticky notes to speed things up, but they often do it blindly:

  • Method A might just throw away the notes that look "boring" (low attention).
  • Method B might just look at the text you typed and try to guess which notes match, ignoring the rest of the picture.

The problem is that these methods are unimodal (using only one type of information). They miss the big picture.

The Solution: MMTok (The Smart Librarian)

The authors of this paper, MMTok, propose a new way to clean up the room. They act like a super-efficient librarian who uses two clues at once to decide which sticky notes to keep:

  1. The Text Clue (What you asked): "I want to know about the dog."
  2. The Visual Clue (What's in the room): "Even if you didn't ask about the dog, there are other important things in this picture, like the tree or the sky, that give context."

The "Coverage" Analogy

Imagine you are packing a suitcase for a trip, but you can only fit 4 items (tokens) instead of the usual 2,880.

  • Old methods might just pick the 4 items that match your shopping list (the text). If you asked for "socks," they pack socks. But if you forgot to mention "shoes," they might leave the shoes behind, even though they are crucial for the trip.
  • MMTok uses a strategy called Maximum Coverage. It asks two questions simultaneously:
    1. "Do these 4 items cover everything I asked for?" (Text-Vision Coverage)
    2. "Do these 4 items cover the most important parts of the entire room?" (Vision-Vision Coverage)

It picks the 4 items that are the best "representatives" of the whole room and answer your specific question. It's like picking a "highlight reel" of the image that tells the whole story, not just the part you explicitly mentioned.

How It Works (The Magic Trick)

The paper describes a mathematical "greedy algorithm" (a step-by-step recipe).

  1. It looks at all the sticky notes.
  2. It picks the one that helps answer your question and represents the image best.
  3. It picks the next one that fills in the biggest gaps left by the first one.
  4. It repeats this until it has the perfect small team of tokens.

Because this process is based on a mathematical concept called "submodularity," the computer can find a near-perfect solution very quickly, without needing to retrain the AI model from scratch.

The Results: Fast and Accurate

The researchers tested this on several famous AI models (like LLaVA and Qwen) and found:

  • Speed: They could cut the number of sticky notes from 2,880 down to just 64 (or even 4 in some cases) and the AI still understood the image almost perfectly.
  • Performance: On some tests, they got 98.7% of the original performance while using only a tiny fraction of the data.
  • Efficiency: It made the AI run 1.87 times faster on a specific dataset (POPE).

The Takeaway

Think of MMTok as a filter that stops the AI from drowning in data. Instead of reading the whole encyclopedia to answer a simple question, it learns to read just the right few pages that contain the answer and the necessary context.

By combining what you asked with what is visually important, MMTok makes AI vision faster, cheaper, and just as smart, without needing to teach the AI anything new. It's the difference between reading a whole book to find a quote versus having a smart index that points you to the exact page instantly.