MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition

The Big Problem: Finding a Needle in a Haystack (That's Also a Puzzle)

Imagine you have a massive photo album of millions of pictures, and you want to find a specific one. You type a search query like: "Find the photo with a girl holding a bird, wearing a shirt with a button, and sitting on a chair."

The Old Way (Single Vector):
Think of this like taking a blurry, low-resolution snapshot of your entire memory of that scene and comparing it to a blurry snapshot of every photo in the album.

The Issue: It's too vague. The computer might find a photo of a girl with a bird, but she's wearing a red dress, not a buttoned shirt. Or it finds the chair, but no bird. Because it tries to summarize the whole image into one "big idea," it loses the tiny details. It's like trying to describe a complex painting by saying, "It's colorful."

The "Better" Way (Multi-Vector / MVR):
To fix this, researchers started breaking the search down. Instead of one big idea, they split your query into pieces: "Girl," "Bird," "Button," "Chair." They also cut the photos into many small puzzle pieces.

The Issue: This is much more accurate, but it's slow. Imagine trying to match every single puzzle piece of your query against every single puzzle piece of every photo in the album. If you have 25 pieces in the photo and 4 pieces in your query, that's 100 comparisons per photo. Multiply that by millions of photos, and your computer starts sweating. It's like hiring 100 detectives to check every single house in a city just to find one person.

The Solution: MIRAGE (The Smart Librarian)

The authors created MIRAGE, a system that acts like a super-smart, efficient librarian. Instead of blindly checking every single detail, MIRAGE uses a hierarchical (layered) approach and runtime scheduling (making smart decisions while working) to speed things up without losing accuracy.

Here is how MIRAGE works, using three main tricks:

1. The "Zoom Lens" Strategy (Hierarchical Decomposition)

In the old "Multi-Vector" method, the computer had to decide on one size for cutting up the photos.

The Problem: If you cut the photo into tiny pieces, you see the "button" clearly, but you might miss the "girl." If you cut it into big chunks, you see the "girl," but the "button" gets lost in the noise. Picking the right size is a guess.
MIRAGE's Trick: MIRAGE doesn't pick just one size. It looks at the photo through multiple zoom levels at once.
- Level 1 (Wide Angle): Looks at the whole image to find the general scene.
- Level 2 (Medium Zoom): Looks at medium-sized chunks to find the "girl" and "chair."
- Level 3 (Macro Zoom): Looks at tiny details to find the "button."
- The Magic: It automatically picks the best "zoom level" for each part of your query. It matches "girl" with the medium zoom and "button" with the macro zoom. This ensures everything aligns perfectly, boosting accuracy.

2. The "Cut the Dead Weight" Strategy (Low-Similarity Tail Pruning)

Imagine you are looking for a specific person in a crowd.

The Old Way: You check every single person's face, even the ones who look nothing like your target, just to be sure.
MIRAGE's Trick: MIRAGE starts with a quick, blurry look (coarse zoom). If a photo looks really different from your search (e.g., it's a picture of a dog, not a girl), MIRAGE says, "Nope, that's not it," and stops checking that photo immediately. It doesn't waste time zooming in on the dog's nose. It only spends time on the photos that look promising. This saves a massive amount of computing power.

3. The "Stop When You're Sure" Strategy (Hierarchy Depth Optimization)

Sometimes, you don't need to look at the finest details to find what you're looking for.

The Old Way: Even if you found the "girl" and the "bird" clearly in the medium zoom, the computer would still force itself to check the tiny "button" details just to be 100% sure, even if the answer was already obvious.
MIRAGE's Trick: MIRAGE monitors its own confidence. As it zooms in, it asks, "Am I getting a better answer?" If the ranking of the best photos stops changing (it's stable), MIRAGE says, "Okay, I'm confident enough," and stops searching deeper. It saves time by not doing unnecessary work.

4. The "Auto-Pilot" (Automated Configuration)

One of the hardest parts of these systems is tuning the settings (how many pieces to cut, how aggressive to be with pruning).

MIRAGE's Trick: Instead of a human guessing the settings, MIRAGE has a built-in "Auto-Pilot." It does a quick, lightweight test run on the specific dataset it's working with, figures out the perfect settings, and then runs the main job. It adapts to the data automatically, so it works great whether you are searching through 1,000 photos or 1 million.

The Result: Fast and Accurate

By combining these strategies, MIRAGE achieves two amazing things:

It's Smarter: It finds the right photos much better than the old methods because it matches the right "zoom level" to the right object.
It's Faster: It cuts out the boring, unnecessary work. The paper claims it is up to 3.5 times faster than the previous best system, while also being more accurate.

In a nutshell:
If the old method was like a student frantically reading every page of every book in a library to find a quote, MIRAGE is like a librarian who quickly scans the table of contents, skips the irrelevant books entirely, and only reads the specific chapters that matter, all while automatically adjusting their reading speed based on how hard the search is.

1. Problem Statement

Retrieval-Augmented Generation (RAG) is essential for Multimodal Large Language Models (MLLMs) to leverage user-specific data. However, existing retrieval methods face a trade-off between accuracy and efficiency:

Single-Vector Retrieval ("1 Mode"): Encodes an entire query and image into a single global vector. While efficient, it loses fine-grained object information, leading to poor accuracy on complex or semantically diverse images.
Multi-Vector Retrieval ("1+N Mode"): Decomposes queries into sub-queries and images into $N$ $N$ segments to match fine-grained objects. While more accurate, it suffers from:
- Sub-optimal Accuracy: Fixed decomposition granularity often misaligns with the varying scales of objects in an image (e.g., splitting a single object or merging unrelated regions).
- High Computational Cost: Matching $N$ sub-queries against $N$ image segments creates massive computational redundancy ( $O(N^2)$ complexity), making it impractical for real-time deployment.
- Lack of Adaptability: Existing methods use static parameters that do not adapt to different datasets or query complexities.

2. Methodology: MIRAGE Framework

MIRAGE introduces a hierarchical decomposition framework combined with runtime scheduling to solve these issues. It operates on three main pillars:

A. Hierarchical Decomposition ("1+M+N Mode")

Instead of a single fixed granularity ( $N$ ), MIRAGE constructs a hierarchy of image segmentations with varying granularities ( $M$ levels, from coarse to fine).

Mechanism: For each sub-query, the system iterates through the hierarchy, computing similarity scores at each granularity level.
Aggregation: The final score for a sub-query is the maximum similarity found across all hierarchy levels. This allows the system to adaptively select the best-fitting segment size for each specific object (e.g., a "chair" might match best at a coarse level, while a "keyboard" matches best at a fine level).
Mathematical Formulation: The scoring function extends the standard MVR formula to include a maximization over the hierarchy dimension $g$ :
$\text{Score}(Q, D_i) = \text{SIM}(E(Q), E(D_i)) + \prod_{k=1}^{N_q} \max_{g=1}^{N_G} \left( \max_{j=1}^{N_g} \text{SIM}(E(q_k), E(D_{i,j}^g)) \right)$

B. Runtime Scheduling & Redundancy Elimination

To counter the increased complexity of the "1+M+N" approach, MIRAGE exploits inherent redundancies in the retrieval process through three optimization mechanisms:

Low-Similarity Tail Pruning:
- Observation: Ground-truth images usually rank highly even at coarse granularities.
- Action: The system prunes low-scoring images from subsequent, finer-grained iterations. This avoids expensive fine-grained calculations for images that are already unlikely to be relevant.
Hierarchy Depth Optimization (Early Exit):
- Observation: Not all queries require the finest granularity; many converge at coarser levels.
- Action: The system monitors the stability of the Top-K ranking using Kendall's $\tau$ coefficient. If the ranking stabilizes between iterations (indicating convergence), the system exits the hierarchy early, skipping deeper levels.
Hollow Hierarchy Elimination:
- Observation: Adjacent granularities often capture the same objects, creating "hollow" (redundant) levels.
- Action: An offline algorithm removes these redundant intermediate granularities from the hierarchy set before runtime, reducing the total number of levels to traverse.

C. Automated Configuration

MIRAGE includes a latency-guided automated configuration algorithm.

It performs a lightweight profiling of the dataset to automatically tune parameters (initial pruning ratio, decay rate, early-exit threshold, and granularity stride).
This ensures the framework adapts to diverse datasets (e.g., CREPE vs. MSCOCO) without manual hyperparameter tuning, balancing the accuracy-efficiency trade-off.

3. Key Contributions

Novel Hierarchical Paradigm: The first work to introduce a "1+M+N" mode in multimodal RAG, enabling adaptive alignment between queries and multi-scale image objects.
Systematic Redundancy Exploitation: A runtime scheduling framework that identifies and eliminates computational redundancy (tail pruning, early exit, and hollow elimination) without sacrificing retrieval quality.
End-to-End Automation: An integrated system that jointly optimizes algorithmic decomposition and computational scheduling, making fine-grained retrieval practical for real-world deployment.

4. Experimental Results

The authors evaluated MIRAGE on four datasets (CREPE, MSCOCO, NoCaps, Flickr) against a Vanilla single-vector baseline and a state-of-the-art Multi-Vector Retrieval (POQD) baseline.

Accuracy: MIRAGE significantly outperforms existing methods.
- It achieves up to 8 percentage points higher NDCG@10 than the Vanilla baseline and 2 percentage points higher than the POQD baseline.
- The hierarchical approach effectively resolves the granularity misalignment issue.
Efficiency: Despite the added complexity of the hierarchy, MIRAGE is highly efficient due to scheduling.
- It achieves up to 3.5× speedup compared to the POQD baseline.
- It reduces computational costs by up to 3.5× while maintaining superior accuracy.
Trade-off: The automated configuration allows MIRAGE to push the Pareto frontier, offering configurations that prioritize either maximum accuracy or maximum throughput depending on the deployment scenario.

5. Significance

MIRAGE represents a paradigm shift in multimodal retrieval. It moves away from static, one-size-fits-all decomposition toward a dynamic, adaptive scheduling approach.

Practicality: By solving the efficiency bottleneck of multi-vector retrieval, it makes high-accuracy, fine-grained image retrieval feasible for production MLLM applications.
Extensibility: The framework provides an extensible foundation for future multimodal RAG systems, demonstrating that algorithmic innovation (hierarchy) and system-level optimization (scheduling) can be co-designed to achieve both state-of-the-art accuracy and real-time performance.