iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

iLLaVA is a novel approach that achieves comprehensive end-to-end acceleration of Large Vision-Language Models by jointly optimizing the image encoder and LLM through a token merging strategy that recycles discarded information, resulting in significant throughput gains and reduced latency while maintaining or improving accuracy.

Lianyu Hu, Liqing Gao, Fanhua Shang, Liang Wan, Wei Feng

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, super-smart assistant (the Large Multimodal Model) who is incredibly good at answering questions about pictures and videos. However, this assistant has a major problem: they are slow and expensive to run.

Here's why: When you show them a photo, the computer first breaks the image down into thousands of tiny puzzle pieces (called tokens). It then sends all of these pieces to the assistant to read. Even if 90% of those pieces are just blue sky or a blank wall, the assistant still has to process every single one. It's like hiring a team of 100 people to read a 1,000-page book, even though the story only happens on 10 pages.

The paper introduces a new method called iLLaVA that solves this by making the assistant faster without making them dumber. Here is how it works, using simple analogies:

1. The Old Way: Cutting the Book

Previous methods tried to speed things up by simply throwing away the boring pages (tokens) before they reached the assistant.

  • The Problem: Imagine you are reading a mystery novel. If you just rip out the pages with the least words, you might accidentally throw away the page where the detective finds the crucial clue. The assistant gets confused because it missed important details.
  • The Bottleneck: These old methods also ignored the "camera" (the Image Encoder) that takes the photo and turns it into puzzle pieces. The camera was doing a lot of heavy lifting, but no one was trying to make that part faster.

2. The iLLaVA Solution: The Smart Editor

iLLaVA is like hiring a super-smart editor who works at two different stages of the process.

Stage A: The Camera (Image Encoder)

Instead of just taking the photo and dumping all the puzzle pieces on the table, iLLaVA looks at the photo while it's being processed.

  • The Analogy: Imagine a security guard at a museum. Instead of letting every visitor (token) into the VIP room, the guard quickly spots the people who are just looking at the walls (redundant info) and tells them to wait outside. But, crucially, the guard doesn't just kick them out; they take a quick note of what those people were looking at.

Stage B: The Assistant (LLM)

The assistant receives fewer puzzle pieces, which makes them much faster. But iLLaVA doesn't stop there. It also applies the same "editor" logic inside the assistant's brain.

3. The Secret Sauce: "Recycling" Information

This is the most creative part of the paper. When the system decides to remove a token (a puzzle piece) because it seems unimportant, it doesn't just delete it.

  • The Analogy: Think of it like a group project. If a team member is quiet and seems to have nothing to say, the old way would fire them. iLLaVA says, "Wait, let's ask them to summarize what the other quiet people were thinking."
  • How it works: The system takes the "boring" tokens and merges them into a few "super-tokens." It condenses the useful bits of information from the discarded pieces and packs them into a representative token. So, the assistant still gets the essence of the information, just in a much smaller package.

Why is this a Big Deal?

The paper shows three amazing results:

  1. Speed: It makes the model 2 times faster (throughput) and cuts the time it takes to start thinking by 4 times.
  2. Smarter than Bigger: Usually, a bigger model is smarter but slower. iLLaVA is so efficient that a large model (like a 26B parameter model) can run faster and be smarter than a small model (like an 8B parameter model). It's like a Ferrari running on a bicycle's fuel tank.
  3. No Brain Damage: Even when they throw away 88% of the visual data, the model still remembers 95% of what it should know. It didn't lose its memory; it just stopped reading the fluff.

The Bottom Line

iLLaVA is a new way to run AI vision models that stops wasting time on empty space. Instead of blindly deleting parts of an image, it intelligently summarizes the boring parts and keeps the important parts. It speeds up the "camera" and the "brain" simultaneously, allowing powerful AI to run on regular computers without losing its smarts.

In short: It turns a slow, bloated AI into a lean, mean, information-processing machine by teaching it how to ignore the noise without losing the signal.