From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings

This paper addresses the NP-hard challenge of optimal offline semantic caching for LLM embeddings by proposing polynomial-time heuristics and novel online policies that leverage recency, frequency, and locality to improve response speed, reduce costs, and enhance semantic accuracy.

Dvir David Biton, Roy Friedman

Published 2026-03-05
📖 6 min read🧠 Deep dive

Imagine you run a very busy, high-end restaurant (the Large Language Model, or LLM). Customers come in asking for complex dishes (queries). Cooking these dishes from scratch takes a lot of time, energy, and expensive ingredients.

To save time and money, you decide to keep a "Memory Bank" (a Semantic Cache) of dishes you've already cooked. If a customer asks for something you've made before, you just serve the leftover instead of cooking a new one.

The Problem: "Exact" vs. "Close Enough"

In the old days of computer caching, your Memory Bank only worked if the customer asked for the exact same dish word-for-word.

  • Old Way: Customer asks for "Spicy Chicken Noodles." You check your bank. If you have "Spicy Chicken Noodles," great! If they ask for "Noodles with Spicy Chicken," you have to cook a new one.

But with modern AI, we want to be smarter. We want Semantic Caching. This means if the customer asks for "Noodles with Spicy Chicken," you recognize it's the same idea as "Spicy Chicken Noodles" and serve the leftover.

The Catch:
In a normal kitchen, you know exactly which pot is which. In the AI kitchen, every dish is represented by a "flavor fingerprint" (an embedding vector). Two dishes might have fingerprints that are very close but not identical.

  • If you keep a pot of "Spicy Chicken Noodles," does it cover "Noodles with Spicy Chicken"? Maybe. Maybe not.
  • If you fill your fridge with 100 pots of slightly different noodle dishes, which ones should you throw out when the fridge is full?

This is the puzzle this paper solves: How do you manage a fridge where "close enough" counts as a match?


The Big Discovery: The "Magic Crystal Ball" is Broken

In computer science, there's a famous rule called Belady's OPT. It's like having a magic crystal ball that tells you exactly what the next customer will order. If you know the future, you can keep the dishes that will be ordered most often and throw out the ones nobody wants. This is the perfect strategy.

The Paper's Finding:
The authors proved that in this new "Semantic Kitchen," the magic crystal ball doesn't work anymore.

  • Why? Because one dish (e.g., "Spicy Chicken") might cover many future requests (e.g., "Spicy Chicken," "Chicken Noodles," "Spicy Noodles").
  • If you use the old crystal ball logic, you might keep a dish that covers just one future request, while throwing away a dish that could have covered ten different requests because they all taste "close enough."

They proved that finding the perfect way to manage this semantic fridge is mathematically impossible to solve quickly (it's NP-hard). It's like trying to solve a Sudoku puzzle where the rules change every time you move a piece.


The Solutions: New Strategies for the Kitchen

Since we can't have a magic crystal ball, the authors invented three new ways to manage the fridge, plus a few tweaks to old methods.

1. The "Cluster" Approach (CRVB)

  • The Idea: Group similar dishes together. If "Spicy Chicken," "Chicken Noodles," and "Spicy Noodles" are all in the same "Cluster," treat them as one big item.
  • The Flaw: In the real world, flavors overlap weirdly. "Spicy Chicken" might be close to "Chicken Noodles," and "Chicken Noodles" might be close to "Beef Noodles," but "Spicy Chicken" and "Beef Noodles" might be totally different. The clusters get messy, and this method isn't perfect.

2. The "Volume" Approach (FGRVB)

  • The Idea: Imagine you can see the future (offline). You look at all the dishes you will need to cook. You pick the specific dishes that, if kept, would "cover" the most future orders.
  • Analogy: You keep a giant pot of "Universal Soup" because it tastes close enough to 50 different future requests, rather than keeping 50 tiny cups of specific soups.
  • Result: This is the best "offline" strategy, but it requires knowing the future, so you can't use it in real-time.

3. The "Next Hit" Approach (RGRVB)

  • The Idea: Instead of looking at all future orders, just look at the very next order that matches this dish.
  • Analogy: You keep the dish that will be needed tomorrow, even if it won't be needed next week. This is good for busy, chaotic kitchens where things change fast.

The Real-World Winner: "SphereLFU"

Since we can't see the future, we need a strategy that works right now (Online). The authors tested many old strategies (like "Throw out the oldest dish" or "Throw out the least popular dish") and found they were okay, but not great.

They invented a new champion called SphereLFU.

  • How it works: Imagine the kitchen floor is covered in a soft, glowing fog. Every time a customer orders a dish, the fog gets thicker in that spot.
  • The Magic: If a customer orders "Spicy Chicken," the fog doesn't just get thicker on the "Spicy Chicken" pot. It gets thicker on the "Chicken Noodles" pot and the "Spicy Noodles" pot too, because they are nearby in the fog.
  • The Result: The pots in the "thickest" parts of the fog (the most popular semantic areas) stay in the fridge. The pots in the empty, thin fog get thrown out.
  • Why it wins: It understands that popularity isn't just about one exact word; it's about the area of the menu that people are hungry for.

The Takeaway

  1. Old rules don't apply: You can't just use standard "first-in, first-out" or "most frequent" rules for AI caches because "close enough" makes things messy.
  2. Perfect is impossible: Finding the absolute best way to manage these caches is mathematically too hard to do quickly.
  3. The New Best Practice: The SphereLFU method is the current winner. It treats the cache like a living map of popularity, keeping the "center" of popular topics and throwing out the weird, isolated outliers.

Why does this matter?
By using these smarter caching strategies, companies can make AI chatbots faster and cheaper. Instead of paying to "cook" a new answer every time, the system can serve a "close enough" answer from memory, saving massive amounts of money and energy.