Strict Optimality of Frequency Estimation Under Local Differential Privacy

Imagine you are a city planner trying to figure out how many people live in each neighborhood. You want to know the exact numbers to plan roads and schools, but you also have a golden rule: You cannot ask anyone for their address directly. If you did, you'd violate their privacy.

This is the world of Local Differential Privacy (LDP). Instead of collecting raw data, you ask people to send you a "noisy" version of their answer. They might say, "I live in Neighborhood A," or they might lie and say, "I live in Neighborhood B," just to protect themselves. The trick is to get the average truth from all these noisy lies without knowing who said what.

For years, computer scientists have been trying to build the best "lie detector" to figure out the real frequencies from these noisy reports. This paper, by Mingen Pan from Google, solves a major mystery: What is the absolute best possible accuracy we can ever achieve?

Here is the breakdown of the paper's discoveries using simple analogies:

1. The "Perfect Lie" Configuration

Imagine you are playing a game where you have to guess a secret number between 1 and 1,000. To protect your privacy, you are allowed to lie, but the rules of the lie are strict:

If the number is your secret, you must tell the truth with a certain probability.
If it's not your secret, you must lie with a specific, calculated probability.

The paper proves that the best possible strategy isn't a complex, messy game. It's a very specific, symmetrical game.

The Analogy: Think of a perfectly balanced spinning wheel. No matter where you start, the wheel looks the same. The paper proves that the most accurate way to estimate frequencies is to use a "spinning wheel" mechanism where every option has the exact same chance of being picked, and the "lie" is perfectly symmetrical.
The Result: They found the exact mathematical formula for this perfect wheel. It turns out, the current best methods (like "Subset Selection") are already using this perfect wheel. They are strictly optimal. You can't do better than this; it's the physical limit of accuracy given the privacy rules.

2. The "Communication Cost" (The Size of the Message)

There's a catch. To get this perfect accuracy, the "perfect wheel" might require a message that is huge.

The Problem: If you have 1,000 neighborhoods, telling the server "I support neighborhoods 1, 2, and 5" might require a very long message. Long messages are expensive to send and slow to process.
The Breakthrough: The paper discovered that you don't need the entire wheel to get the perfect result. You only need a tiny, specific slice of it.
The Analogy: Imagine you need to describe a massive painting. You don't need to send the whole canvas. You only need to send a few specific brushstrokes that, when combined, allow the viewer to reconstruct the whole image perfectly.
The Math: They proved you can shrink the message size down to roughly the square root of the number of options. If you have 100 options, you don't need 100 bits of data; you only need about 7 or 8 bits. This is a massive reduction in "data traffic."

3. The Three Tools (Which one should you use?)

The paper proposes three different tools to achieve this perfect accuracy, depending on your situation:

Tool A: Subset Selection (The "Gold Standard")
- How it works: It uses the "perfect wheel" directly.
- Pros: It is mathematically perfect.
- Cons: The message size is still a bit large for huge lists (like millions of items).
- Best for: Small to medium-sized lists.
Tool B: Optimized Count-Mean Sketch (The "Smart Shortcut")
- How it works: It's a modified version of a popular, fast method called "Count-Mean Sketch." The paper tweaked it to make it nearly perfect.
- Pros: It sends tiny messages (very efficient) and is super fast.
- Cons: It's only "perfect" if your list of items is very large (like 100+ items).
- Best for: Huge lists (like millions of web pages or user IDs). The paper shows that for big lists, this shortcut is practically indistinguishable from the perfect method.
Tool C: Weighted Subset Selection (The "Custom Builder")
- How it works: This is a new algorithm the authors built to create the "tiny slice" of the perfect wheel mentioned in point #2.
- Pros: It achieves the perfect accuracy with the smallest possible message size.
- Cons: It takes a lot of computer power to design the wheel before you can use it.
- Best for: When you need the absolute smallest message size and have time to prepare the system in advance.

4. The Real-World Test

The authors didn't just do math on paper; they ran experiments.

They tested these tools on fake data (like a Zipf distribution, which mimics how words appear in a book) and real data (clicks on a news website).
The Result: The tools worked exactly as the math predicted. The "Optimized Count-Mean Sketch" was so good that for large lists, it was impossible to tell the difference between it and the theoretical "perfect" limit.

The Takeaway

This paper is like finding the speed limit for a car.

We now know the absolute fastest speed (highest accuracy) possible for privacy-preserving data collection.
We know that the current "fastest cars" (Subset Selection) are already hitting that speed limit.
We found a way to build a smaller, lighter car (Optimized Count-Mean Sketch) that can hit that same speed limit if the road is long enough (large dictionary size).

In short: If you are collecting private data, you now have a clear rulebook. For small lists, use the established "Subset Selection." For massive lists, use the new "Optimized Count-Mean Sketch." You can't get more accurate than this without breaking the privacy rules.

Here is a detailed technical summary of the paper "Strict Optimality of Frequency Estimation Under Local Differential Privacy" by Mingen Pan.

1. Problem Statement

The paper addresses the fundamental problem of frequency estimation under Local Differential Privacy (LDP). In LDP, data is perturbed locally on the client side before transmission to a server to prevent privacy breaches. The goal is to estimate the frequency distribution of a dataset (dictionary size $d$ ) with maximum precision (minimum error) given a privacy budget $\epsilon$ .

While numerous algorithms exist (e.g., Randomized Response, Subset Selection, Count-Mean Sketch), it has been an open question whether existing state-of-the-art methods (specifically Subset Selection) are strictly optimal. Previous work established lower bounds for error, but a significant gap remained between these theoretical lower bounds and the actual performance of existing algorithms, particularly in the constant terms of the error metrics.

2. Methodology

The authors employ a rigorous mathematical framework to derive strict lower bounds and construct optimal estimators:

Symmetric and Extremal Configurations:
- The paper proves that any optimal frequency estimator can be transformed into one with an extremal configuration (where output probabilities for any input are either $p_o$ or $e^\epsilon p_o$ ) and a symmetric configuration (where self-support and cross-support probabilities are constant across all inputs).
- They introduce a Uniformly Random Permutation (URP) technique. By applying a random permutation to the dictionary and the perturbation matrix, they demonstrate that the worst-case error of any estimator is upper-bounded by the error of its "symmetrized" version.
Derivation of Strict Lower Bounds:
- The authors formulate the L1 and L2 loss functions as functions of the support size ( $k$ ), which is the number of dictionary elements supported by a single response.
- By optimizing the reconstruction matrix and the perturbation matrix under the constraints of symmetry and extremality, they derive the strict lower bounds for L1 and L2 losses.
- They prove that the optimal support size $k$ is approximately $\frac{d}{e^\epsilon + 1}$ .
Communication Cost Analysis:
- The paper analyzes the communication cost (bits required to transmit a response). Using Carathéodory's theorem, they prove that an optimal estimator does not need to utilize all $\binom{d}{k}$ possible subsets. Instead, it suffices to use at most $\frac{d(d-1)}{2} + 1$ distinct responses to satisfy the necessary constraints for optimality.

3. Key Contributions

A. Theoretical Strict Optimality

The paper establishes the first strict lower bounds for L1 and L2 losses in LDP frequency estimation.

Result: For $d \ge e^\epsilon + 1$ , the minimum L2 loss is:
$\min_{\hat{f}} L_2(\hat{f}) = \frac{(d-1)[4de^\epsilon - (e^\epsilon + 1)^2]}{nd(e^\epsilon - 1)^2}$
Significance: This closes the gap between theoretical lower bounds and existing algorithms. It proves that the Subset Selection (SS) mechanism, previously considered state-of-the-art, is indeed strictly optimal in terms of precision.

B. Communication Cost Reduction

The authors derive that the communication cost for an optimal estimator can be bounded by:
$\log_2\left(\frac{d(d-1)}{2} + 1\right)$
This is significantly lower than the cost of standard Subset Selection (which is $O(d)$ or $O(k \log d)$ ) for large $d$ , approaching $O(\log d)$ .

C. Proposed Algorithms

The paper proposes and analyzes three algorithms:

Subset Selection (SS): Proven to be strictly optimal in precision but has high communication cost ( $O(d)$ ).
Weighted Subset Selection (WSS): An algorithm constructed to achieve the strict lower bound with the reduced communication cost of $\log_2(\frac{d(d-1)}{2} + 1)$ $lo g_{2} (\frac{d ( d - 1 )}{2} + 1)$ . It uses Linear Programming (LP) or Non-Negative Least Squares (NNLS) to find a sparse set of responses that satisfy the optimal symmetric configuration.
- Trade-off: High precomputation cost ( $O(d^6)$ ) to generate the support matrix.
Optimized Count-Mean Sketch (OCMS): A modified version of the Count-Mean Sketch.
- Modification: Expands the dictionary to the next prime number, sets the hash range $m \approx 1 + e^\epsilon$ , and uses a specific hash family.
- Result: When the dictionary size $d$ is sufficiently large (e.g., $d \ge 100$ for $\epsilon=1$ ), OCMS is practically indistinguishable from the theoretical optimum, with error increases of less than 0.1%. It offers logarithmic communication cost and zero precomputation cost.

4. Experimental Results

The authors conducted experiments on both synthetic (Zipf distribution) and real-world (Kosarak click-stream) datasets:

Precision: All three proposed algorithms (SS, WSS, OCMS) align perfectly with the derived strict theoretical lower bounds for L1 and L2 losses.
OCMS Performance: In the real-world experiment ( $d=26,000$ ), OCMS performed identically to the optimal Subset Selection, validating the theoretical claim that for large $d$ , the approximation error is negligible.
Consistency: The empirical results confirm that the theoretical derivations hold across varying privacy budgets ( $\epsilon$ ) and dataset sizes.

5. Significance and Practical Guidelines

This paper provides a definitive solution to the frequency estimation problem under LDP, offering clear guidelines for deployment:

Small Dictionaries ( $d$ is small): Use Subset Selection (or Weighted Subset Selection if communication bandwidth is the primary constraint and precomputation is feasible).
Large Dictionaries ( $d$ is large, e.g., $>100$ ): Use Optimized Count-Mean Sketch (OCMS). It offers near-optimal precision with significantly lower communication costs ( $O(\log d)$ ) and no heavy precomputation requirements, making it the most practical choice for large-scale systems (like those used by tech giants).

Conclusion: The paper resolves the long-standing question of whether existing LDP frequency estimators are optimal. It proves that Subset Selection is strictly optimal and provides a pathway to achieve this optimality with reduced communication costs via WSS and OCMS, bridging the gap between theoretical limits and practical implementation.