Amortizing Maximum Inner Product Search with Learned Support Functions

Imagine you are the librarian of a massive library containing millions of books (the database). Every day, thousands of people walk in asking for a specific book that matches their current mood or topic (the query).

Traditionally, to find the best book, the librarian has to walk down every single aisle, pick up every book, and check if it matches the request. This is called Maximum Inner Product Search (MIPS). It works, but if the library has millions of books, it takes forever. It's like trying to find a needle in a haystack by checking every single piece of hay one by one.

The Old Way: The "Brute Force" Librarian

Current methods try to speed this up by organizing the library with complex maps, indexes, or by squashing the books into smaller boxes (quantization). These are like having a "Find" button on a computer. But these maps are static. They treat every visitor as if they are a stranger, even if you know that 90% of your visitors usually ask for mystery novels on Tuesdays. They don't learn from the patterns of who is asking what.

The New Way: The "Amortized" Librarian

This paper proposes a new kind of librarian: a Neural Network that acts like a super-intelligent, experienced guide. Instead of building a map, we train this guide to instantly know the answer.

The authors call this Amortized MIPS. "Amortized" is a fancy word meaning "spreading the cost over time."

The Cost: It takes a long time to train this guide (like studying for years).
The Payoff: Once trained, the guide can answer any question from a regular visitor instantly, without looking at the shelves. The heavy lifting was done once during training, so the daily work is free.

The Secret Sauce: The "Support Function"

How does the guide know the answer so fast? The paper uses a clever mathematical trick involving something called a Support Function.

Think of the library's books as a group of people standing in a field.

If you stand in a specific spot and shout, "Who is closest to me?" the person who steps forward is the answer.
The paper realizes that the "distance" (or score) you shout is actually a shape called a Support Function.
The Magic: If you know the shape of this field, you don't need to look at the people. You just need to know the slope of the ground at your feet. The direction the ground slopes down points exactly to the person you need!

The paper builds two types of guides based on this idea:

1. SupportNet: The "Topographer"

This guide learns to draw a perfect 3D map of the field (the Support Function).

How it works: When a visitor asks a question, the guide looks at the map, calculates the slope at that exact spot, and the slope points directly to the best book.
Pros: It's mathematically perfect and very accurate.
Cons: Calculating the slope takes a little bit of extra brainpower (computation) every time.

2. KeyNet: The "Intuitive Guide"

This guide skips the map entirely. It learns to look at the visitor and directly point to the correct book.

How it works: It's like a seasoned librarian who sees a customer and immediately says, "You want The Great Gatsby," without thinking about the map or the slope.
Pros: It's incredibly fast because it doesn't need to calculate slopes. It just outputs the answer.
Cons: It's a bit harder to train to be perfect, but once it learns, it's a speed demon.

The "Cluster" Trick

What if the library is too big for even one guide? The authors suggest splitting the library into 10 smaller sections (clusters).

They train a team of guides (or one multi-tasking guide) to first guess which section the visitor belongs to.
Once the section is identified, they only search that small section.
This is like a receptionist who asks, "Are you looking for fiction or non-fiction?" before you even walk into the main library. It saves a ton of time.

The Results: Why Should We Care?

The authors tested this on real-world data (like searching through millions of Wikipedia articles or Q&A forums).

Speed: Their AI guides were much faster than traditional search engines for specific types of questions.
Accuracy: They found the right answers almost as often as the slow, exhaustive method.
Compression: They showed that you can "compress" the library. Instead of storing millions of books, you can store a small AI model that knows where the books are.

The Bottom Line

This paper is about teaching a computer to stop searching and start predicting.

Instead of treating every search query as a new, unknown problem, the computer learns the "personality" of the questions it usually gets. It builds a mental shortcut (a neural network) that instantly knows the answer, saving massive amounts of time and energy. It's the difference between looking up a phone number in a directory every time you need it, versus having a friend who knows everyone's number by heart and just tells you immediately.

Here is a detailed technical summary of the paper "Amortizing Maximum Inner Product Search with Learned Support Functions."

1. Problem Definition

Maximum Inner Product Search (MIPS) is a fundamental subroutine in machine learning, used to find a vector $y^*$ in a database $Y$ that maximizes the inner product with a query vector $x$ :
$y^*(x) = \arg \max_{y \in Y} \langle x, y \rangle$

The Challenge: Exact MIPS requires $O(nd)$ time complexity (where $n$ is the database size and $d$ is dimensionality), which becomes computationally prohibitive for large-scale datasets (millions of vectors).
Limitations of Current Methods: Existing approximate MIPS methods (e.g., hashing, quantization, graph-based indices) rely on query-agnostic indexing structures. They treat queries as arbitrary vectors and do not leverage the specific distribution of queries encountered in real-world applications.
The Goal: To develop a method that amortizes the computational cost of search by learning to directly predict MIPS solutions for queries drawn from a known distribution $p_X$ , rather than building generic indices.

2. Core Insight: Support Functions

The authors leverage a key mathematical insight: the MIPS value function (the maximum inner product) is the support function of the set of keys $Y$ :
$\sigma_Y(x) = \max_{y \in Y} \langle x, y \rangle$

This function possesses critical properties:

Convexity: It is the pointwise maximum of linear functions.
Positive 1-Homogeneity: $\sigma_Y(\alpha x) = \alpha \sigma_Y(x)$ for $\alpha > 0$ .
Gradient Property (Envelope Theorem): The gradient of the support function with respect to the query $x$ yields the optimal key:
$\nabla_x \sigma_Y(x) = y^*(x)$

This establishes a connection to Optimal Transport (OT), where the mapping $x \mapsto y^*(x)$ is a Brenier map (the gradient of a convex potential). Unlike standard OT learning where the potential must be discovered, in MIPS, the ground-truth potential ( $\sigma_Y(x)$ ) and its gradient ( $y^*(x)$ ) can be computed via exhaustive search for training data.

3. Methodology

The paper proposes Amortized MIPS, a learning-based approach using two complementary neural network architectures:

A. SupportNet (Learning the Potential)

Architecture: Based on Input Convex Neural Networks (ICNNs). These networks are constrained to be convex with respect to their input $x$ .
Mechanism: The network $f_\theta(x)$ approximates the support function $\sigma_Y(x)$ .
Inference: The optimal key is recovered via automatic differentiation: $y_{pred} = \nabla_x f_\theta(x)$ .
Homogeneity Enforcement: To ensure the 1-homogeneity property, the authors use a "homogenization wrapper": $H[g](x) = \|x\| \cdot g(x/\|x\|)$ , or by setting biases to zero with ReLU activations.
Loss Functions:
1. Score Regression: Minimizes the squared error between predicted and true support values.
2. Gradient Matching: Minimizes the squared error between the network's gradient and the ground-truth optimal key.

B. KeyNet (Direct Key Regression)

Architecture: A standard vector-valued neural network (MLP) with no convexity constraints.
Mechanism: Directly regresses the optimal key $F_\theta(x) \approx y^*(x)$ .
Inference: No backward pass (gradient computation) is required, making it faster at inference time.
Loss Functions:
1. Key Regression: Minimizes the distance between the predicted key and the ground-truth key.
2. Score Consistency: Derived from Euler's Theorem for homogeneous functions ( $\langle \nabla f(x), x \rangle = f(x)$ ). The network is penalized if the dot product of its output and the input deviates from the true support score: $\langle F_\theta(x), x \rangle \approx \sigma_Y(x)$ .

C. Multi-Task & Clustered Variants

The framework extends to clustered databases ( $Y = \cup Y_j$ ). A single model learns $c$ support functions simultaneously (one per cluster).

Routing: The model predicts scores for all clusters, allowing the system to route a query to the most promising cluster(s) before performing an exhaustive search within that subset. This acts as a learned routing mechanism.

4. Key Contributions

Novel Paradigm: Introduced "Amortized MIPS," shifting from query-agnostic indexing to query-dependent neural prediction.
Two Architectures: Proposed SupportNet (ICNN-based, gradient-based recovery) and KeyNet (direct regression, no gradient overhead).
Theoretical Grounding: Formulated the problem using support functions and Euler's theorem, introducing specific loss functions (gradient matching, score consistency) to enforce structural properties.
Multi-Task Learning: Demonstrated how to jointly learn support functions for clustered data, enabling efficient two-stage search (routing + local search).
Empirical Validation: Showed that learned models can outperform standard approximate indices (like FAISS) when queries follow a specific distribution, offering better accuracy for the same computational budget.

5. Experimental Results

The authors evaluated the approach on four BEIR retrieval datasets (FIQA, Quora, Natural Questions, HotpotQA) with database sizes ranging from 50k to 5.2M vectors.

Routing Accuracy: In the clustered setting ( $c=10$ ), both SupportNet and KeyNet significantly outperformed a centroid-based baseline in routing queries to the correct cluster. KeyNet showed slight advantages in accuracy, while SupportNet was more compute-efficient for smaller models.
Retrieval Metrics:
- Match Rate & MRR: Learned models achieved high match rates.
- Relative Transport Error (RTE): A new metric measuring how close the predicted key is to the ground truth relative to the query. Models achieved low RTE (high precision).
Integration with Approximate Search:
- When KeyNet was used to map a query $x$ to a predicted key $\hat{y}$ , and this $\hat{y}$ was used to query a FAISS IVF index, the system achieved higher recall for the same computational budget compared to querying with the original $x$ .
- Essentially, the neural network "pre-corrects" the query to be closer to the true answer, making the subsequent approximate search more effective.
Trade-offs: Larger models (more parameters) improved performance. KeyNet generally offered a better speed/accuracy trade-off for inference due to the lack of backward passes, while SupportNet provided a theoretically tighter link to the convex optimization problem.

6. Significance and Future Directions

Latency Sensitivity: This approach is ideal for applications with predictable query distributions (e.g., recommendation systems, specific NLP tasks) where the one-time training cost is amortized over millions of fast inferences.
Database Compression: By learning to predict the "best" key directly, the method effectively compresses the database information into neural weights, potentially reducing the need for massive index structures.
Limitations: The method assumes a fixed, known query distribution. Performance may degrade on out-of-distribution queries. Scaling to billions of vectors requires efficient pre-computation pipelines.
Future Work: Potential directions include online learning to adapt to distribution shifts and knowledge distillation from larger models.

In summary, the paper successfully bridges convex analysis, optimal transport, and deep learning to create a highly efficient, learned retrieval system that outperforms traditional indexing methods in distribution-specific scenarios.