⚛️ quantum physics

Quantum Sketches, Hashing, and Approximate Nearest Neighbors

This paper proves that, despite potential quantum query-time speedups, it is impossible to compress $n$ -point approximate nearest neighbor data structures into $O(\log n)$ qubits within a broad quantum sketch model, as any such scheme requires $\Omega(n)$ qubits due to a reduction to quantum random access codes and Nayak's lower bound.

Original authors: Sajjad Hashemian

Published 2026-02-24

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Sajjad Hashemian

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Dream: The "Magic Quantum Filing Cabinet"

Imagine you have a massive library containing millions of books (your data points). You want to build a super-fast search engine that, when you ask for a book similar to a specific one, instantly finds a good match.

In the classical world, to do this, you need a lot of storage space (hard drives) to keep track of all those books. But quantum computers are famous for being able to store huge amounts of information in very small spaces.

The Dream: Researchers hoped they could compress this entire library of millions of books into a tiny, magical "quantum sketch"—a state of just a few quantum bits (qubits)—roughly the size of a single book's index. They imagined that by measuring this tiny sketch in different ways, the computer could instantly find the nearest neighbor, just like a magic trick.

The Reality Check: This paper says, "No, you can't do that."

The authors prove that no matter how clever your quantum magic is, you cannot compress a general dataset of $n$ items into a tiny quantum sketch (like $O(\log n)$ qubits) and still expect it to answer "nearest neighbor" questions correctly. To store the information needed to answer these questions, you actually need a quantum memory size that grows linearly with the number of items ( $O(n)$ ).

The Analogy: The "Bit-Revealing" Game

To understand why this is impossible, let's play a game.

The Setup:
Imagine you have a secret code made of $n$ bits (a string of 0s and 1s). You want to hide this code inside a tiny quantum box (the sketch).
You have a set of $n$ "magic keys" (queries).

If you use Key #1, the box must reveal the 1st bit of your secret code.
If you use Key #2, the box must reveal the 2nd bit.
...and so on for all $n$ keys.

The Problem:
The paper shows that for certain types of data (specifically, points in a high-dimensional space), finding the "nearest neighbor" is exactly the same as this game.

If the nearest neighbor to Query #1 is "Book A," it means the 1st bit of your secret code is 0.
If the nearest neighbor is "Book B," it means the 1st bit is 1.

The Conclusion:
If your tiny quantum box can successfully tell you the nearest neighbor for every possible query, it effectively has to reveal every single bit of your secret code.
But there is a fundamental law of quantum mechanics (Nayak's Lower Bound) that says: You cannot store $n$ independent bits of information in a quantum state that is smaller than $n$ qubits.

If you try to squeeze all that information into a tiny box, the box will "break" (the measurement will fail), and you won't be able to retrieve the correct answer.

The "JL Reduction" Misconception

You might ask: "But wait, isn't there a math trick called Johnson-Lindenstrauss (JL) that shrinks high-dimensional data into very low dimensions?"

Yes, there is. The JL lemma says you can project a giant 1,000-dimensional object onto a 10-dimensional surface and keep the distances roughly the same. This makes people think, "If the data fits in 10 dimensions, maybe it only needs 10 qubits!"

The Paper's Rebuttal:
The authors say: "The dimension isn't the problem; the information is."
Even if the data lives in a tiny 10-dimensional space, the relationships between the points can be so complex that they encode $n$ independent secrets. Compressing the coordinates doesn't compress the answers to the questions. The "bottleneck" isn't how big the data looks; it's how much distinct information you need to remember to answer the questions.

So, Is Quantum Computing Useless for Search?

Absolutely not! The paper is very careful to say this is not a "no quantum advantage" result. It just rules out one specific type of compression.

Here is where quantum computers can still win:

The "Candidate Scanning" Analogy:
Imagine a classical search engine works like this:

It uses a hash function to find a small list of 1,000 "candidate" books that might be the answer.
It then checks all 1,000 books one by one to see which is the best match. This takes 1,000 steps.

The Quantum Upgrade:
If you have a quantum computer that can look at those 1,000 candidates in a "superposition" (checking them all at once), it can use Grover's Algorithm.

Instead of checking 1,000 books one by one, the quantum computer can find the best match in roughly $\sqrt{1,000}$ steps (about 31 steps).
This is a quadratic speedup. It's a huge improvement, but it's not the "magic compression" the paper debunked.

The Takeaway

The Dream Failed: You cannot shrink a massive, complex dataset into a microscopic quantum state and expect it to work perfectly for all search queries. The information content is too high.
The Reason: The data contains too many independent "secrets" (bits) that need to be revealed by different queries. Quantum mechanics forbids hiding that many secrets in a small box.
The Silver Lining: Quantum computers are still great at searching through a list of candidates quickly. If you use classical methods to narrow down the list to a few candidates, a quantum computer can find the winner much faster than a classical one.

In short: Quantum computers can't be a "magic compression card" for your entire database, but they can be a "super-fast flashlight" to find the right item once you've narrowed down the search area.

1. Problem Statement

The paper addresses the fundamental question of whether Approximate Nearest Neighbor (ANN) data structures can be compressed into a logarithmic number of qubits ( $O(\log n)$ ) while retaining worst-case query power.

Context: Classical ANN algorithms often use Locality-Sensitive Hashing (LSH) to reduce high-dimensional data to compact sketches. In the quantum realm, the Johnson-Lindenstrauss (JL) lemma suggests that $n$ points can be projected into $O(\log n)$ dimensions. Combined with amplitude encoding (where a vector is represented by $\log(\text{dimension})$ qubits), it was hypothesized that an entire dataset of $n$ points could be stored as a single short quantum state ( $O(\log n)$ qubits).
The Hypothesis: One might expect that query-dependent measurements on this state could act like hash functions, allowing for efficient ANN retrieval.
The Goal: To determine if a "quantum sketch" model exists where an $n$ -point dataset is encoded into an $m$ -qubit state $\rho_P$ , and queries are answered via measurements on fresh copies of $\rho_P$ , such that $m = O(\log n)$ .

2. Methodology

The author employs an information-theoretic lower bound approach, specifically reducing the ANN problem to the Quantum Random Access Code (QRAC) problem.

A. The Quantum Sketch Model

The paper defines a broad model for quantum sketches:

Encoder: Maps a dataset $P = \{p_1, \dots, p_n\}$ to an $m$ -qubit density matrix $\rho_P$ .
Decoder: For a query $q$ , performs an arbitrary quantum measurement (dependent on $q$ ) on a fresh copy of $\rho_P$ and outputs an index.
Note: The requirement for a fresh copy per query is a strong condition; if the bound holds here, it holds for any single-copy reusable scheme.

B. Construction of Hard Instances

To prove the lower bound, the author constructs a specific family of datasets in Hamming space $\{0, 1\}^d$ :

Code Construction: Uses a probabilistic method (Lemma 1) to generate $n$ codewords $C(1), \dots, C(n)$ of length $m = \Theta(\log n)$ such that the Hamming distance between any distinct pair is at least $m/4$ .
Dataset Definition: For each bit string $x \in \{0, 1\}^n$ $x \in {0, 1}^{n}$ , a dataset $P_x$ $P_{x}$ is constructed. For each index $i$ $i$ , the dataset contains a point $p_i$ $p_{i}$ which is either $u_i = (C(i), 0)$ $u_{i} = (C (i), 0)$ or $v_i = (C(i), 1)$ $v_{i} = (C (i), 1)$ , depending on the bit $x_i$ $x_{i}$ .
- This creates a "tight pair" for each $i$ where the last bit distinguishes the two points, while the prefix ensures points from different indices are far apart.
Query Design: Queries are defined as $q_i = u_i$ .
Forcing Lemma: It is proven (Lemma 2) that for any approximation factor $c \ge 1$ , the correct ANN answer for query $q_i$ uniquely reveals the bit $x_i$ . If $x_i=0$ , the nearest neighbor is $u_i$ (distance 0). If $x_i=1$ , the nearest neighbor is $v_i$ (distance 1), while all other points are at distance $\ge c+1$ .

C. Reduction to QRAC

The construction implies that if a quantum sketch exists for this ANN problem, it can be used to decode any bit $x_i$ from the state $\rho_{P_x}$ with probability $p > 1/2$ .

This transforms the ANN sketch into an $(n, m, p)$ -QRAC.
By Nayak's Lower Bound (Theorem 1), any QRAC for $n$ bits with success probability $p > 1/2$ requires $m \ge (1 - h(p))n$ qubits, where $h(p)$ is the binary entropy function.

3. Key Contributions and Results

Main Theorem (Theorem 2)

For any approximation factor $c \ge 1$ and success probability $p > 1/2$ , any quantum sketch encoding an $n$ -point dataset into an $m$ -qubit state to solve $c$ -ANN must satisfy:
$m = \Omega(n)$
Implication: It is impossible to compress an arbitrary $n$ -point dataset into $O(\log n)$ qubits while preserving worst-case ANN capabilities. The memory requirement scales linearly with the number of points, not logarithmically.

Capacity Viewpoint (Proposition 1)

The paper generalizes this result using the VC-dimension (or Natarajan dimension for multi-class). If the family of datasets induces a function class that "shatters" $t$ queries, the memory required is $\Omega(t)$ . This highlights that the bottleneck is the combinatorial richness of the query-answer behavior, not the geometric dimension of the data.

Quantum Speedup Limits (Theorems 3 & 4)

While dataset compression is impossible, the paper clarifies where quantum advantages do exist:

Candidate Scanning: If the dataset is stored classically (or via QRAM) and hashing produces a candidate set of size $M$ , Grover's algorithm can reduce the search time from $O(M)$ to $O(\sqrt{M})$ .
Optimality: The BBBV theorem proves that this quadratic speedup is essentially optimal for unstructured candidate validation.
Distinction: The lower bound applies to compressing the data into a state, whereas the speedup applies to searching within a candidate set given coherent access.

4. Significance and Implications

Refutation of the "Quantum Sketch" Dream: The paper definitively rules out the hope that JL-type dimensionality reduction combined with amplitude encoding can create ultra-compact quantum ANN data structures. The obstruction is informational, not geometric. Even in low dimensions ( $d = \Theta(\log n)$ ), the information content of $n$ distinct configurations requires $\Omega(n)$ qubits to be recoverable.
Clarification of Quantum Advantage: The work delineates the frontier of quantum utility in similarity search.
- No: Compressing arbitrary datasets into logarithmic qubits.
- Yes: Quadratic speedups in the query phase (searching candidates) when the data is accessible via coherent oracles.
Future Directions: The paper suggests that practical quantum ANN algorithms must either:
- Rely on restricted dataset families with low combinatorial dimension (avoiding the QRAC obstruction).
- Exploit specific structural properties (e.g., algebraic structure or promise gaps) within the candidate buckets that allow for speedups beyond standard Grover search.

Conclusion

Sajjad Hashemian's paper establishes a fundamental limit on quantum data compression for nearest neighbor search. It proves that while quantum mechanics offers quadratic speedups for searching through candidates, it cannot bypass the information-theoretic cost of storing arbitrary high-cardinality datasets in a compressed quantum state. The "bottleneck" for quantum ANN is the recoverable classical information, not the coordinate representation size.