Imagine you have a very smart, but slightly slow, assistant (a Multimodal Large Language Model, or MLLM) who is trying to solve a puzzle. You give this assistant a giant photo (the visual input) and a question (the text input).
The problem is that the photo is made up of thousands of tiny puzzle pieces (called tokens). To understand the photo, the assistant has to look at every single piece, one by one, and compare it to every other piece. This is like trying to read a book by comparing every letter to every other letter in the room—it takes forever and uses up a massive amount of energy.
Current methods try to speed this up by just throwing away some puzzle pieces early on. But the authors of this paper, HiDrop, realized that these methods are throwing away the wrong pieces at the wrong times. They are like a chef who throws away the fresh vegetables before they've even been chopped, or keeps stirring a pot long after the soup is done.
Here is how HiDrop fixes this, using three simple ideas:
1. The "Late Arrival" Strategy (Late Injection)
The Problem: Imagine you are in a meeting. The first few minutes are just people sitting down, checking their phones, and getting coffee. The actual work doesn't start until everyone is settled.
The Old Way: Current models try to process the "visual puzzle pieces" from the very first second of the meeting, even though no one is listening yet. It's a waste of time.
The HiDrop Fix: HiDrop says, "Let the assistant ignore the photo completely until the meeting actually starts." It waits until the very moment the "work" begins (the middle layers of the model) before bringing the photo in. This saves a huge amount of energy because the assistant isn't wasting time looking at the photo while it's just "getting coffee."
2. The "Smart Pyramid" (Concave Pyramid Pruning)
The Problem: Once the meeting starts, the team needs to look at the photo. But looking at every single puzzle piece is still too slow. Old methods say, "Let's throw away 10% of the pieces every 5 minutes." This is too rigid. Sometimes you need to throw away a lot quickly; sometimes you need to be careful.
The HiDrop Fix: HiDrop uses a "Smart Pyramid" approach.
- Early in the meeting: The team realizes, "Wow, 90% of these puzzle pieces are just blue sky or empty floor. They aren't important!" They quickly toss those away.
- Later in the meeting: As they get to the interesting parts (the faces, the objects), they slow down and only toss away the truly useless pieces.
- The Result: They keep the "good" pieces and dump the "bad" ones much faster and more intelligently than before, like a funnel that gets narrower exactly where it needs to.
3. The "Early Exit" (Early Exit)
The Problem: Imagine the team has figured out the puzzle. They know the answer. But they keep staring at the photo for another hour just because the meeting schedule says so.
The Old Way: The model keeps processing the photo until the very end, even when the answer is already obvious.
The HiDrop Fix: HiDrop has a "Stop Sign." Once the team has combined the photo and the question to form a clear idea (usually in the middle of the process), HiDrop says, "Great job! You don't need to look at the photo anymore." It throws the rest of the photo away and lets the assistant finish the job using only their memory of the photo. This is like leaving a party early once you've said your goodbyes.
The Secret Sauce: "Persistent Name Tags"
When you start throwing pieces away, it gets confusing. "Which piece was number 5? Is it still there?"
HiDrop gives every puzzle piece a permanent name tag (Positional Encoding) that never changes, even if the piece is moved or hidden. This ensures the assistant never gets lost or confused about where things are, even as the pile of pieces shrinks.
The Result?
By using these three tricks, HiDrop is like turning a slow, heavy truck into a sleek sports car.
- Speed: It trains the model 1.7 times faster.
- Efficiency: It throws away 90% of the visual data (the puzzle pieces) without losing any accuracy.
- Smarts: It understands when to look at the picture and when to stop, rather than just blindly processing everything.
In short, HiDrop teaches the AI to be lazy in the right places (ignoring the photo when it's not needed) and efficient in the right places (quickly filtering out the noise), making it faster, cheaper, and just as smart as before.
Get papers like this in your inbox
Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.