Optimal partition selection with R\'enyi differential privacy

Imagine you are a librarian trying to create a "Top 10 Most Popular Books" list for a library, but you have a strict rule: you cannot reveal anything about any single person's reading habits. This is the world of Differential Privacy.

In this scenario, the "books" are data partitions (like keywords, URLs, or product IDs), and the "popularity" is how many people checked them out. The challenge is: How do you pick the most popular books to show the public without accidentally revealing that you read a specific book?

This paper, written by researchers from Google, tackles this problem by inventing a smarter, more efficient way to make these privacy-safe lists. Here is the breakdown using simple analogies.

1. The Problem: The "Noisy" Filter

Traditionally, to protect privacy, librarians would add a little bit of "static" or "noise" to the popularity counts.

The Old Way (Gaussian/Laplace Noise): Imagine you have a scale to weigh books. To hide the exact weight of a single book, you shake the scale a bit. If a book is very popular, the shaking won't hide it. If it's barely popular, the shaking might make it look popular (a false alarm) or hide it completely.
The Limitation: The old methods were designed for a specific type of "shaking" (mathematically called additive noise). They work okay, but they are a bit clumsy. They often throw away good data just to be safe, or they keep bad data that shouldn't be there.

2. The Breakthrough: A Smarter "Gatekeeper"

The authors realized that instead of just shaking the scale, they could build a smart gatekeeper that knows exactly how much noise is needed right now to stay safe.

They introduced a new concept called Rényi Differential Privacy (RDP).

The Analogy: Think of standard privacy as a "one-size-fits-all" raincoat. It keeps you dry, but it's bulky. RDP is like a custom-tailored suit. It fits the specific situation perfectly, allowing you to be more agile (release more useful data) while staying just as dry (private).
The Result: They found the mathematically perfect algorithm for picking which items to release when each user only contributes one item. It's like finding the exact threshold where you can say, "Yes, this book is popular enough to show," without ever risking a privacy leak.

3. The "SNAPS" Mechanism: Handling Heavy Loaders

What if a user doesn't just check out one book, but a whole stack of 50? The math gets messy.

The Solution: They created a new tool called SNAPS (Smooth Norm-Aware Partition Selection).
The Analogy: Imagine a bouncer at a club.
- Old Bouncer: Checks if you have any VIP pass. If you have 50 passes, they get confused and might let you in even if you shouldn't be, or kick you out unfairly.
- SNAPS Bouncer: Looks at the total weight of your passes. If you have a heavy stack, the bouncer adjusts the "noise" (the security check) smoothly based on how heavy that stack is.
Why it matters: They tested this by plugging SNAPS into existing systems (like those used for analyzing Reddit posts or Twitter trends). The result? The systems released 10% to 20% more useful data than before, while staying just as private. It's like getting a bigger, clearer picture of the crowd without compromising anyone's anonymity.

4. The "Cost of Knowing the Count"

Here is the most surprising finding in the paper.

The Scenario: Sometimes, you don't just want to know which books are popular; you also want to know exactly how many people read them (the count).
The Trade-off: The authors proved that if you insist on releasing the exact count (using the old "additive noise" method), you pay a privacy tax. You have to throw away more data to stay safe.
The Metaphor: It's like trying to take a photo of a crowd.
- If you just want to know who is there (the list), you can take a very sharp, high-definition photo.
- If you also want to know exactly how many people are in the photo (the count), you have to blur the image significantly to hide individual faces.
The Lesson: If you don't need the exact numbers, stop trying to get them. Use the new "non-additive" methods to get a much sharper, more useful list.

Summary: Why Should You Care?

This paper is a toolkit for data scientists who want to learn from user data without spying on them.

Better Privacy: They found the most efficient way to filter data so that privacy is never compromised.
More Data: By using their new "SNAPS" tool, companies can release more accurate insights (like trending topics or popular products) than ever before.
Smarter Choices: They showed that if you don't need exact numbers, you shouldn't use the old, clunky methods. Use the new, sharp tools instead.

In short, they turned a blunt instrument (old privacy filters) into a precision scalpel, allowing us to see more of the world's data without ever seeing the individuals behind it.

Here is a detailed technical summary of the paper "Optimal partition selection with Rényi differential privacy" by Charlie Harrison and Pasin Manurangsi.

1. Problem Definition

The paper addresses the Partition Selection Problem in the context of Differential Privacy (DP).

Context: In private data analysis (e.g., private GROUP BY queries), one must identify and release a subset of "keys" (partitions) from a dataset where the universe of possible keys is potentially unbounded or exponential.
Constraint: The mechanism must satisfy a differential privacy constraint while maximizing the number of released partitions (utility).
Challenge: Previous work established optimal algorithms for $(\epsilon, \delta)$ $(ϵ, δ)$ -DP when users contribute only a single partition. However, real-world scenarios often involve:
1. Users contributing multiple partitions (or weighted partitions).
2. The need for tighter privacy accounting under composition (sequential or parallel).
3. The trade-off between releasing just the partition set vs. releasing the partition set and its noisy count/weight.

2. Methodology and Framework

2.1 Privacy Model: Approximate Rényi Differential Privacy (RDP)

The authors utilize $\delta$ -approximate $(\alpha, \epsilon)$ -Rényi Differential Privacy (RDP).

Advantage: RDP offers tighter composition bounds compared to standard $(\epsilon, \delta)$ -DP, especially when $\alpha$ is finite. As $\alpha \to \infty$ , RDP recovers standard DP.
Composition: The paper leverages the additive property of RDP parameters under composition to achieve better utility in sequential adaptive algorithms.

2.2 Optimal Partition Selection (Single Partition per User)

For the case where each user contributes to exactly one partition ( $\Delta_1 = 1$ ), the authors derive an optimal partition selection primitive $\pi^*$ .

Mechanism: The algorithm defines a release probability function $\pi(n)$ for a partition with count $n$ .
Construction: $\pi^*(n)$ is computed iteratively. For a given previous probability $q = \pi^*(n-1)$ , the next probability $p = \pi^*(n)$ is the maximum value such that the approximate RDP divergence between Bernoulli distributions $Ber(p)$ and $Ber(q)$ (in both directions) is $\leq \epsilon$ .
Result: This mechanism is proven to be optimal, meaning no other mechanism satisfying the same privacy constraints can release a partition with a higher probability for any count $n$ .

2.3 Non-Existence of Optimality (Multiple Partitions)

The paper proves that when users can contribute to multiple partitions ( $\Delta_1 > 1$ ), no single optimal mechanism exists that dominates all others.

Implication: There is a fundamental trade-off where maximizing the release of partitions for one dataset configuration might violate optimality for another. This motivates the design of practical, high-utility heuristics rather than a universal "optimal" solution.

2.4 The SNAPS Mechanism (Weighted Partition Selection)

To handle the non-optimality in the multi-partition case and weighted inputs, the authors introduce SNAPS (Smooth Norm-Aware Partition Selection).

Concept: A weighted partition selection primitive $\phi_r$ designed for inputs bounded by an $L_r$ norm.
Mechanism: It discretizes the weight space and applies a recursive logic similar to the optimal single-partition case, but adjusted for the sensitivity of the $L_r$ norm.
Utility: It is designed to be a "drop-in" replacement for the Gaussian mechanism (when $r=2$ ) and Laplace mechanism (when $r=1$ ) in existing adaptive algorithms.

2.5 Additive Noise vs. Non-Additive Mechanisms

The authors investigate the cost of mechanisms that release both the partition set and the noisy count (additive noise mechanisms).

Additive Noise: Mechanisms that add noise to the count and threshold it (e.g., Truncated Discrete Laplace). These allow releasing the count "for free."
Non-Additive (Optimal): Mechanisms like $\pi^*$ that only output the set.
Finding: There is a numerical separation in privacy. For finite $\alpha$ , additive noise mechanisms are strictly sub-optimal compared to non-additive mechanisms for the same utility. Releasing the count incurs an inherent privacy cost.

3. Key Contributions

Optimal Algorithm for Single Partition: Theorem 14 presents the exact optimal algorithm for partition selection under approximate RDP for the single-partition case. It generalizes previous $(\epsilon, \delta)$ -DP results and leverages finite $\alpha$ for tighter composition.
Non-Existence Proof: Theorem 16 proves that no optimal mechanism exists when users contribute multiple partitions, highlighting the complexity of the general case.
SNAPS Mechanism: A new, practical algorithm for weighted partition selection that satisfies $L_r$ norm bounds. It serves as a superior subroutine for adaptive algorithms.
Additive Noise Separation: The paper formulates a convex program to find the optimal additive noise mechanism and demonstrates a privacy gap between additive and non-additive approaches. It shows that if the count is not needed, additive noise techniques are sub-optimal.
Fast RDP Computation: The authors provide a greedy "water-filling" algorithm to compute approximate RDP divergence for discrete distributions in $O(n \log n)$ time, facilitating the numerical optimization of these mechanisms.

4. Experimental Results

The authors evaluated SNAPS by integrating it into two state-of-the-art partition selection algorithms:

PolicyGaussian (from [GGK+20])
MAD2R (from [CCAEZ25])

Setup:

Datasets: Reddit, Wiki, Twitter, Finance, Amazon, IMDb.
Parameters: $(\epsilon=1, \delta=10^{-5})$ -DP, $\Delta_0 = 100$ .
Comparison: Replaced the Gaussian mechanism in these algorithms with SNAPS.

Findings:

Performance: SNAPS consistently outperformed the Gaussian-based baselines.
Utility Gain: The output size (number of released partitions) improved by 10% to 20% across all datasets and algorithms.
Conclusion: SNAPS provides state-of-the-art performance for both parallel and sequential adaptive partition selection.

5. Significance and Implications

Practical Utility: The SNAPS mechanism offers an immediate, drop-in improvement for existing private data analysis systems (like private SQL engines or analytics platforms) that rely on partition selection.
Theoretical Insight: The paper clarifies the fundamental limits of partition selection. It establishes that while an optimal solution exists for simple cases, the general case requires trade-offs.
Design Guidance: The separation between additive and non-additive mechanisms provides a crucial design rule:
- If the count/weight of partitions is required, additive noise mechanisms are necessary but come with a privacy cost.
- If only the set of partitions is needed, non-additive mechanisms (like the derived $\pi^*$ or SNAPS) are significantly more efficient and should be preferred.
Future Directions: The work opens avenues for designing mechanisms compatible with Privacy Loss Distributions (PLDs) and further optimizing multi-stage adaptive algorithms.

In summary, this paper advances the state of the art in private partition selection by providing optimal theoretical bounds for simple cases, a highly effective practical mechanism (SNAPS) for complex cases, and a clear understanding of the privacy costs associated with releasing partition counts.

Optimal partition selection with Rényi differential privacy