iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

Imagine you have a brilliant, super-smart assistant (the Large Multimodal Model) who is incredibly good at answering questions about pictures and videos. However, this assistant has a major problem: they are slow and expensive to run.

Here's why: When you show them a photo, the computer first breaks the image down into thousands of tiny puzzle pieces (called tokens). It then sends all of these pieces to the assistant to read. Even if 90% of those pieces are just blue sky or a blank wall, the assistant still has to process every single one. It's like hiring a team of 100 people to read a 1,000-page book, even though the story only happens on 10 pages.

The paper introduces a new method called iLLaVA that solves this by making the assistant faster without making them dumber. Here is how it works, using simple analogies:

1. The Old Way: Cutting the Book

Previous methods tried to speed things up by simply throwing away the boring pages (tokens) before they reached the assistant.

The Problem: Imagine you are reading a mystery novel. If you just rip out the pages with the least words, you might accidentally throw away the page where the detective finds the crucial clue. The assistant gets confused because it missed important details.
The Bottleneck: These old methods also ignored the "camera" (the Image Encoder) that takes the photo and turns it into puzzle pieces. The camera was doing a lot of heavy lifting, but no one was trying to make that part faster.

2. The iLLaVA Solution: The Smart Editor

iLLaVA is like hiring a super-smart editor who works at two different stages of the process.

Stage A: The Camera (Image Encoder)

Instead of just taking the photo and dumping all the puzzle pieces on the table, iLLaVA looks at the photo while it's being processed.

The Analogy: Imagine a security guard at a museum. Instead of letting every visitor (token) into the VIP room, the guard quickly spots the people who are just looking at the walls (redundant info) and tells them to wait outside. But, crucially, the guard doesn't just kick them out; they take a quick note of what those people were looking at.

Stage B: The Assistant (LLM)

The assistant receives fewer puzzle pieces, which makes them much faster. But iLLaVA doesn't stop there. It also applies the same "editor" logic inside the assistant's brain.

3. The Secret Sauce: "Recycling" Information

This is the most creative part of the paper. When the system decides to remove a token (a puzzle piece) because it seems unimportant, it doesn't just delete it.

The Analogy: Think of it like a group project. If a team member is quiet and seems to have nothing to say, the old way would fire them. iLLaVA says, "Wait, let's ask them to summarize what the other quiet people were thinking."
How it works: The system takes the "boring" tokens and merges them into a few "super-tokens." It condenses the useful bits of information from the discarded pieces and packs them into a representative token. So, the assistant still gets the essence of the information, just in a much smaller package.

Why is this a Big Deal?

The paper shows three amazing results:

Speed: It makes the model 2 times faster (throughput) and cuts the time it takes to start thinking by 4 times.
Smarter than Bigger: Usually, a bigger model is smarter but slower. iLLaVA is so efficient that a large model (like a 26B parameter model) can run faster and be smarter than a small model (like an 8B parameter model). It's like a Ferrari running on a bicycle's fuel tank.
No Brain Damage: Even when they throw away 88% of the visual data, the model still remembers 95% of what it should know. It didn't lose its memory; it just stopped reading the fluff.

The Bottom Line

iLLaVA is a new way to run AI vision models that stops wasting time on empty space. Instead of blindly deleting parts of an image, it intelligently summarizes the boring parts and keeps the important parts. It speeds up the "camera" and the "brain" simultaneously, allowing powerful AI to run on regular computers without losing its smarts.

In short: It turns a slow, bloated AI into a lean, mean, information-processing machine by teaching it how to ignore the noise without losing the signal.

Here is a detailed technical summary of the paper "iLLaVA: An Image Is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models".

1. Problem Statement

Large Vision-Language Models (LVLMs) face significant bottlenecks in computational complexity and resource demands, primarily due to the quadratic ( $O(n^2)$ ) complexity of self-attention mechanisms relative to the number of input tokens.

Current Limitations: Existing acceleration methods focus narrowly on pruning or compressing image tokens before or within the Large Language Model (LLM) stage.
The Overlooked Bottleneck: These approaches ignore the Image Encoder, which itself consumes a substantial portion of inference time (often 40% or more, increasing with video inputs) and is the primary source of input tokens for the LLM.
The Consequence: Reducing tokens only in the LLM fails to achieve true end-to-end acceleration. Furthermore, aggressive token pruning often leads to significant performance degradation because critical visual information is discarded without recovery.

2. Methodology: iLLaVA

The authors propose iLLaVA, a framework that jointly optimizes the Image Encoder and the LLM to achieve comprehensive acceleration. The core design consists of two main innovations:

A. Two-Stage Token Merging

Unlike previous methods that operate solely within the LLM, iLLaVA performs token reduction in both the Image Encoder and the LLM.

Image Encoder Stage: Token merging modules are inserted after the Multi-Head Attention (MHA) blocks of specific encoder layers (e.g., Layers 5, 9, 13). This reduces the visual token count early in the pipeline, drastically lowering the computational load for subsequent encoder blocks and the LLM.
LLM Stage: Token merging is applied after specific LLM blocks (e.g., Layers 2, 8, 14) to further compress the sequence before the final generation.
Dynamic Allocation: The computational budget is dynamically adjusted by modulating the number of tokens reduced ( $R_v$ in encoder, $R_t$ in LLM) across selected blocks ( $B_v, B_t$ ).

B. Novel Token Merging Strategy (Recycling Information)

To mitigate performance loss from token reduction, iLLaVA introduces a strategy that "recycles" information from discarded tokens rather than simply deleting them. The process involves:

Importance Scoring: Utilizing attention scores ( $S_{avg}$ ) derived from the attention mechanism to identify token importance.
Token Classification:
- Informative Tokens ( $P^i_v$ ): The top-scoring tokens are preserved directly to maintain critical visual features.
- Recycled Tokens ( $P^c_v$ ): Instead of discarding the remaining tokens, a subset of them (based on attention scores) is selected as "cluster centers."
Information Aggregation: The discarded tokens are grouped with their most similar "cluster center" recycled tokens. Their features are merged via a weighted sum based on normalized attention scores.
Output: The final output consists of the preserved informative tokens and the aggregated recycled tokens, ensuring that potentially beneficial information from redundant tokens is retained.

3. Key Contributions

Holistic Acceleration: iLLaVA is the first method to jointly accelerate the Image Encoder and the LLM, addressing the full inference pipeline rather than just the LLM stage.
Information Recycling: The proposed merging strategy effectively recycles information from discarded tokens, preventing the performance degradation typically associated with aggressive token pruning.
Training-Free: The method operates without requiring additional training or fine-tuning of the base model.
Efficiency-Accuracy Trade-off: It demonstrates that larger models equipped with iLLaVA can outperform smaller models in both accuracy and throughput, effectively breaking the traditional efficiency-accuracy trade-off.

4. Experimental Results

The method was evaluated on over 10 image and video understanding benchmarks (e.g., MMMU, MMBench, VideoMME, MLVU) using models like Qwen2.5-VL, InternVL-2.5, and LLaVA-OneVision.

Performance Retention:
- At a 66.7% token reduction, iLLaVA retains 99.2% of the original model's performance.
- At an extreme 88.9% reduction, it maintains 95.2% performance, significantly outperforming state-of-the-art (SOTA) pruning and merging methods (e.g., FasterVLM, PyramidDrop, VisionZip).
Efficiency Gains:
- Throughput: Achieves up to 2.12× speedup.
- Prefilling Time: Reduces prefilling latency by 4.46×.
- Memory: Reduces memory usage by 1.59×.
Model Scaling:
- An InternVL-2.5 26B model with iLLaVA achieves higher throughput and better accuracy than a standard InternVL-2.5 8B model.
- Similarly, Qwen2.5-VL 7B with iLLaVA outperforms Qwen2.5-VL 3B.
Flexibility: The method is compatible with various architectures (Qwen, InternVL, MiniCPM) and works effectively on single-image, multi-image, and video tasks.

5. Significance

Redefining Efficiency: iLLaVA proves that the Image Encoder is a critical, previously underutilized bottleneck for acceleration. By optimizing this stage, the model achieves compounded efficiency gains.
Practical Deployment: The ability to run larger, more capable models with the speed and memory footprint of smaller models makes LVLMs viable for real-time applications and resource-constrained environments.
Theoretical Insight: The paper provides visualizations showing that visual redundancy is unevenly distributed across layers and that the LLM tends to focus on tokens closer to the output text, offering new insights into how LVLMs process multimodal contexts.

In conclusion, iLLaVA represents a paradigm shift in LVLM optimization by moving from partial token pruning to a holistic, information-recycling approach that enables "an image to be worth fewer than 1/3 of its input tokens" without sacrificing intelligence.