Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

This paper proposes a lightweight token pruning framework that filters non-informative background regions and refines fragmented text areas in document images to significantly reduce computational costs in vision-language models while maintaining comparable accuracy.

Jaemin Son, Sujin Choi, Inyong Yun

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you have a massive, high-resolution photo of a messy desk covered in papers, coffee stains, and scattered office supplies. You want a super-smart AI assistant to read a specific contract on that desk and pull out the "Lease Year" and "Total Amount."

Currently, the AI tries to look at every single square inch of that photo, even the empty white space, the coffee cup, and the dust motes. It's like hiring a team of 1,000 detectives to search a warehouse, but 900 of them are just staring at the empty floor. It's incredibly slow and expensive, even though the AI only needs to read the text on the paper.

This paper proposes a clever, lightweight solution to fix that. Here is the breakdown in simple terms:

1. The Problem: The AI is Overworked

Modern AI models (called Vision-Language Models) are great at reading documents, but they are "gluttons" for data. They process every pixel of an image, even the boring background. This makes them slow and energy-hungry, like a sports car trying to drive through a traffic jam of empty space.

2. The Solution: The "Smart Bouncer"

The authors created a tiny, super-fast "bouncer" that stands at the entrance of the AI's brain.

  • How it works: Before the main AI even looks at the image, this bouncer scans the picture in tiny squares (patches).
  • The Decision: If a square has text, the bouncer says, "Keep this!" If it's just a white margin or a coffee stain, the bouncer says, "Trash it!"
  • The Result: The main AI only has to process the important text parts. This cuts the work (computational power) by 40% to 60%.

3. The Secret Sauce: "Don't Lose Your Place"

Here is the tricky part. If you just throw away the background, the remaining text pieces might get shuffled around or lose their order.

  • The Analogy: Imagine you have a jigsaw puzzle. If you throw away the blue sky pieces and just hand the remaining pieces to a friend, they might try to put the puzzle together by guessing where the pieces go. They might put the "Lease Year" next to the "Coffee Cup" because they lost the map.
  • The Fix: The authors' method is special because it preserves the original coordinates. Even though they throw away the background, they tell the AI, "This text piece is still at position #45, and this one is at #46." The AI doesn't have to guess where things go; it keeps the original map. This is crucial for understanding documents where where the text is located matters just as much as what the text says.

4. The "Safety Net" (Max-Pooling)

Sometimes, the "bouncer" is a little too strict and accidentally throws away a tiny piece of a word (like the top of a letter 'T').

  • The Fix: The authors added a "safety net" step called Max-Pooling. Imagine the bouncer draws a box around the text. If the box is a little too small, the safety net expands the box slightly to grab any missed edges. This ensures that if a word is split, the AI still sees it as one whole word.

The Results: Fast, Cheap, and Accurate

When they tested this on real-world documents (like receipts and legal contracts):

  • Speed: The AI became much faster, using less than half the computing power.
  • Accuracy: It didn't get "dumb." Because they kept the original positions (indices) and used the safety net, the AI still understood the documents almost perfectly.
  • Comparison: Other methods tried to shuffle pieces around to save space, but that confused the AI, leading to bad results. This method kept the pieces in their original seats, which worked much better.

In a nutshell: They built a smart filter that throws away the "boring background" before the AI starts working, but they made sure to keep a detailed map so the AI never loses its place. This makes reading documents fast, cheap, and just as smart as before.