MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition

MIRAGE is an efficient runtime scheduling framework for multi-vector image retrieval that employs a novel hierarchical decomposition paradigm with automatic parameter configuration to significantly enhance retrieval accuracy while reducing computational costs by up to 3.5 times compared to existing systems.

Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng, Yinjun Wu, Chenchen Liu, Xiang Chen

Published 2026-03-04
📖 5 min read🧠 Deep dive

The Big Problem: Finding a Needle in a Haystack (That's Also a Puzzle)

Imagine you have a massive photo album of millions of pictures, and you want to find a specific one. You type a search query like: "Find the photo with a girl holding a bird, wearing a shirt with a button, and sitting on a chair."

The Old Way (Single Vector):
Think of this like taking a blurry, low-resolution snapshot of your entire memory of that scene and comparing it to a blurry snapshot of every photo in the album.

  • The Issue: It's too vague. The computer might find a photo of a girl with a bird, but she's wearing a red dress, not a buttoned shirt. Or it finds the chair, but no bird. Because it tries to summarize the whole image into one "big idea," it loses the tiny details. It's like trying to describe a complex painting by saying, "It's colorful."

The "Better" Way (Multi-Vector / MVR):
To fix this, researchers started breaking the search down. Instead of one big idea, they split your query into pieces: "Girl," "Bird," "Button," "Chair." They also cut the photos into many small puzzle pieces.

  • The Issue: This is much more accurate, but it's slow. Imagine trying to match every single puzzle piece of your query against every single puzzle piece of every photo in the album. If you have 25 pieces in the photo and 4 pieces in your query, that's 100 comparisons per photo. Multiply that by millions of photos, and your computer starts sweating. It's like hiring 100 detectives to check every single house in a city just to find one person.

The Solution: MIRAGE (The Smart Librarian)

The authors created MIRAGE, a system that acts like a super-smart, efficient librarian. Instead of blindly checking every single detail, MIRAGE uses a hierarchical (layered) approach and runtime scheduling (making smart decisions while working) to speed things up without losing accuracy.

Here is how MIRAGE works, using three main tricks:

1. The "Zoom Lens" Strategy (Hierarchical Decomposition)

In the old "Multi-Vector" method, the computer had to decide on one size for cutting up the photos.

  • The Problem: If you cut the photo into tiny pieces, you see the "button" clearly, but you might miss the "girl." If you cut it into big chunks, you see the "girl," but the "button" gets lost in the noise. Picking the right size is a guess.
  • MIRAGE's Trick: MIRAGE doesn't pick just one size. It looks at the photo through multiple zoom levels at once.
    • Level 1 (Wide Angle): Looks at the whole image to find the general scene.
    • Level 2 (Medium Zoom): Looks at medium-sized chunks to find the "girl" and "chair."
    • Level 3 (Macro Zoom): Looks at tiny details to find the "button."
    • The Magic: It automatically picks the best "zoom level" for each part of your query. It matches "girl" with the medium zoom and "button" with the macro zoom. This ensures everything aligns perfectly, boosting accuracy.

2. The "Cut the Dead Weight" Strategy (Low-Similarity Tail Pruning)

Imagine you are looking for a specific person in a crowd.

  • The Old Way: You check every single person's face, even the ones who look nothing like your target, just to be sure.
  • MIRAGE's Trick: MIRAGE starts with a quick, blurry look (coarse zoom). If a photo looks really different from your search (e.g., it's a picture of a dog, not a girl), MIRAGE says, "Nope, that's not it," and stops checking that photo immediately. It doesn't waste time zooming in on the dog's nose. It only spends time on the photos that look promising. This saves a massive amount of computing power.

3. The "Stop When You're Sure" Strategy (Hierarchy Depth Optimization)

Sometimes, you don't need to look at the finest details to find what you're looking for.

  • The Old Way: Even if you found the "girl" and the "bird" clearly in the medium zoom, the computer would still force itself to check the tiny "button" details just to be 100% sure, even if the answer was already obvious.
  • MIRAGE's Trick: MIRAGE monitors its own confidence. As it zooms in, it asks, "Am I getting a better answer?" If the ranking of the best photos stops changing (it's stable), MIRAGE says, "Okay, I'm confident enough," and stops searching deeper. It saves time by not doing unnecessary work.

4. The "Auto-Pilot" (Automated Configuration)

One of the hardest parts of these systems is tuning the settings (how many pieces to cut, how aggressive to be with pruning).

  • MIRAGE's Trick: Instead of a human guessing the settings, MIRAGE has a built-in "Auto-Pilot." It does a quick, lightweight test run on the specific dataset it's working with, figures out the perfect settings, and then runs the main job. It adapts to the data automatically, so it works great whether you are searching through 1,000 photos or 1 million.

The Result: Fast and Accurate

By combining these strategies, MIRAGE achieves two amazing things:

  1. It's Smarter: It finds the right photos much better than the old methods because it matches the right "zoom level" to the right object.
  2. It's Faster: It cuts out the boring, unnecessary work. The paper claims it is up to 3.5 times faster than the previous best system, while also being more accurate.

In a nutshell:
If the old method was like a student frantically reading every page of every book in a library to find a quote, MIRAGE is like a librarian who quickly scans the table of contents, skips the irrelevant books entirely, and only reads the specific chapters that matter, all while automatically adjusting their reading speed based on how hard the search is.