Efficient Vector Search in the Wild: One Model for Multi-K Queries

Imagine you are running a massive, high-speed library where millions of books (vectors) are stored. Your job is to find the "best match" for a user's request. Sometimes a user asks for just the one best book ( $K=1$ ). Other times, they want a list of the top 100 best books ( $K=100$ ).

In the past, finding these books was like searching a dark maze. You had to set a "search budget" (how many steps you take) before you started.

If you set the budget too low, you might miss the best book (low accuracy).
If you set it too high, you waste time walking around the library when you could have stopped earlier (high latency).

The Problem: The "One-Size-Fits-None" Model

Recently, scientists built a "Smart Librarian" (a machine learning model) who could look at the maze and say, "Stop! You've found the best book!" This saved a lot of time.

But there was a catch: This Smart Librarian was trained only to find the single best book.

If you asked for the Top 1, the librarian was perfect.
If you asked for the Top 10, the librarian got confused. It would stop too early because it was used to finding just one, leaving you with a bad list.
If you asked for the Top 1, but the librarian was trained on "Top 100," it would walk the whole maze unnecessarily, wasting time.

To fix this, other researchers tried training a different librarian for every possible number (one for Top 1, one for Top 10, one for Top 100...).
The downside? Training all these librarians takes forever and costs a fortune in computing power. It's like hiring 50 different specialists just to answer simple questions.

The Solution: OMEGA (The "Master Detective")

The authors of this paper introduce OMEGA, a new system that solves this with a clever trick.

1. The "One Model to Rule Them All"

Instead of hiring 50 librarians, OMEGA hires one Master Detective who is an expert at finding the single best item ( $K=1$ ).

The Magic Trick: If you need the Top 5, OMEGA doesn't ask for a new expert. It asks the Master Detective to find the #1 book. Then, it says, "Okay, ignore that book. Now, find the best book among the remaining ones."
It repeats this process 5 times.
Why it works: The paper discovered that the pattern of how the detective gets closer to the target (the "distance trajectory") looks the same whether they are looking for the #1 book or the #50 book. So, the same detective can do all the jobs if you just tell them to ignore the ones they already found.

2. The "Crystal Ball" (Statistical Forecasting)

There was still a problem: Asking the detective to find the Top 100 one by one takes 100 questions. That's too many questions!

The Fix: OMEGA uses a "Crystal Ball" (statistics).
Once the detective has found the first 20 books, the Crystal Ball can predict: "Based on the pattern, there's a 95% chance the remaining 80 books are already in your list. You don't need to ask the detective anymore!"
This allows OMEGA to stop asking questions early, saving massive amounts of time.

The Real-World Impact

The authors tested this in a real-world scenario (Alibaba's cloud database), which handles millions of queries with different "Top K" requests every day.

Speed: OMEGA is 6% to 33% faster than the current best methods.
Cost: It requires only 16% to 30% of the preparation time (training cost) compared to other methods.
Accuracy: It hits the same high accuracy targets as the others.

The Analogy Summary

Think of it like a GPS navigation system:

Old Way: You have a different GPS map for every trip length. If you want a 1-mile trip, you load the "1-mile map." If you want a 100-mile trip, you load the "100-mile map." Loading all these maps takes forever.
The "Bad" Smart GPS: You have one GPS that only knows how to find the nearest gas station. If you ask for the "Top 5 gas stations," it gets lost or takes too long.
OMEGA: You have one GPS that is an expert at finding the nearest gas station.
1. It finds the nearest one.
2. It marks it as "visited."
3. It instantly finds the next nearest one from the remaining list.
4. After finding a few, it uses a statistical guess to say, "I'm 99% sure the next 95 stations are right here, so let's just stop and give you the list."

Result: You get the perfect list of gas stations, instantly, without needing to download a million different maps.

Here is a detailed technical summary of the paper "Efficient Vector Search in the Wild: One Model for Multi-K Queries" (OMEGA).

1. Problem Statement

The paper addresses a critical gap in Learned Approximate Nearest Neighbor Search (ANNS) for vector databases. While existing learned search methods (e.g., DARTH, LAET) successfully reduce latency by training models to predict when to stop a search, they suffer from a fundamental limitation: they are trained for a specific $K$ value (number of results) and fail to generalize to multi- $K$ workloads.

In real-world production environments (e.g., Alibaba's cloud services), vector databases must handle diverse and dynamic $K$ values (e.g., a user might request 5 results, then 20, then 100).

Accuracy Degradation: A model trained for a small $K$ (e.g., $K=1$ ) tends to under-search when applied to larger $K$ values, missing the required ground-truth results.
Performance Loss: A model trained for a large $K$ tends to over-search for smaller $K$ values, incurring unnecessary latency.
Prohibitive Preprocessing Cost: Training separate models for every common $K$ value or a single model covering all $K$ s requires massive preprocessing time (up to $1.95\times$ the cost of a single model) and storage, which is unacceptable for vendors who do not charge users for preprocessing.

2. Methodology: OMEGA

The authors propose OMEGA (One-Model Efficient Generalized ANNS), a framework that achieves high accuracy and low latency for arbitrary $K$ queries using only a single model trained on $K=1$ .

Core Concept: Top-1 Reduction via Masking

OMEGA leverages the insight that a Top- $K$ search can be decomposed into $K$ sequential Top-1 searches.

Base Model: Train a single model to predict whether the current search set contains the Top-1 result.
Iterative Refinement:
- Find the Top-1 vector.
- Mask this vector in the search set (effectively removing it from consideration).
- The problem of finding the Top-2 becomes finding the Top-1 in the masked set.
- Repeat this process until $K$ results are found.

Key Technical Challenges & Solutions

Challenge 1: Feature Generalization with Masking

Problem: Existing features (e.g., minimal distance in DARTH) fail when vectors are masked. Masking changes the absolute distances, causing the model to make incorrect early-stopping predictions.
Solution: Distance-Trajectory Features.
- Instead of absolute distances, OMEGA uses the trajectory of distance reduction during the search.
- The pattern of how distances decrease as the search approaches a target is consistent regardless of whether previous vectors are masked.
- The model input uses a sliding window of recent distance values (e.g., the last 100 steps) and extracts statistical features (mean, variance, percentiles) to capture this trend.

Challenge 2: Reducing Model Invocation Overhead

Problem: Naive iterative refinement requires $K$ model invocations, which adds up and negates latency gains.
Solution: Statistical Forecasting.
- The authors identified a statistical property: Given that $N$ nearest neighbors have been found, the probability of the $r$ -th ground-truth neighbor being in the current set follows a predictable distribution.
- They pre-compute a 2D lookup table mapping $(N, r)$ to the probability of finding the $r$ -th result.
- Optimization: Before invoking the model for the next step, the system checks the lookup table. If the expected recall (based on the table) already meets the target, the search stops immediately without further model calls. This is particularly effective for large $K$ .

System Architecture

Offline Phase: During index compaction, a single Top-1 model is trained using LightGBM (Gradient Boosting Decision Trees) on trajectory features. A statistical probability table is also built.
Online Phase: The system performs a dynamic search. It alternates between graph traversal steps and model checks. It uses the statistical forecast to skip unnecessary model invocations and uses the masking technique to iteratively find Top- $K$ results.

3. Key Contributions

First K-Generalizable Learned Search: OMEGA is the first method to support arbitrary $K$ queries using a single model trained only on $K=1$ , eliminating the need for multi-model training.
Trajectory-Based Features: Introduction of distance-trajectory features that generalize robustly across masked search sets, solving the accuracy degradation issue of previous methods.
Statistical Forecasting: A novel technique to predict recall and skip model invocations, significantly reducing inference overhead for large $K$ .
Low Preprocessing Cost: Achieves the performance of multi-model systems with only 16–30% of the preprocessing time required by state-of-the-art baselines.

4. Experimental Results

The system was evaluated on three public datasets (BIGANN, DEEP, GIST) and three production datasets from Alibaba, using real-world multi- $K$ traces.

Latency Reduction:
- Compared to state-of-the-art learned methods (DARTH, LAET), OMEGA achieves 6–33% lower average latency under the same preprocessing budget.
- Compared to manual hyperparameter tuning (Fixed search steps), OMEGA achieves 25–65% lower latency.
Preprocessing Efficiency:
- To achieve optimal latency, OMEGA requires only 16–30% of the preprocessing time needed by baselines (which must train multiple models).
- Total CPU cost (Preprocessing + Serving) is reduced by 4–24% compared to baselines.
Accuracy:
- OMEGA consistently meets the target recall (e.g., 0.95) across all $K$ values, whereas baselines suffer from accuracy drops when $K$ deviates from their training value.
Tail Latency:
- Significant improvements in P90 and P99 latency (up to 42% reduction), crucial for latency-sensitive applications.

5. Significance

Practical Deployment: OMEGA solves the "multi- $K$ " problem that has hindered the adoption of learned search in production. It allows vector database providers to offer high-performance, adaptive search without incurring massive preprocessing costs or complex model management.
Cost Efficiency: By minimizing preprocessing time, it aligns with the business model of cloud providers who do not charge for index building but do charge for serving.
Open Source: The system is open-sourced and is being integrated into Zvec, Alibaba's open-source vector database, indicating its readiness for real-world adoption.

In summary, OMEGA demonstrates that a single, well-designed model trained on a simple task (Top-1) can effectively solve complex, variable-parameter search problems (Top- $K$ ) through clever feature engineering (trajectories) and statistical optimization, offering a superior trade-off between accuracy, latency, and cost.