Estimation of discrete distributions in relative entropy, and the deviations of the missing mass

Imagine you are a detective trying to figure out the "flavor profile" of a giant, mysterious soup. You can only take a few spoonfuls (a sample) to guess the recipe (the true distribution). Your goal is to write down a new recipe that is as close as possible to the real one.

In the world of statistics, this is called estimating a discrete distribution. The "flavors" are the different classes (like words in a language, or colors in a bag of marbles), and the "recipe" is the probability of each flavor appearing.

The paper by Jaouad Mourtada tackles a very specific, tricky way of measuring how wrong your guess is: Relative Entropy (or Kullback-Leibler divergence).

The Problem: The "Zero" Mistake

Most ways of measuring error (like "Total Variation") are forgiving. If you guess a flavor has a 0% chance of appearing, but it actually appears 1% of the time, the error is small.

But Relative Entropy is a harsh judge. It says: "If you assign a 0% probability to something that actually exists, you are infinitely wrong."

Analogy: Imagine you are a weather forecaster. If you say there is a 0% chance of rain, but it rains, you are a terrible forecaster. If you say there is a 0% chance of a specific rare bird appearing in your garden, but it does, your prediction is catastrophically bad.
The Trap: The most obvious way to guess the recipe is to just count what you saw (the "Empirical Distribution"). If you didn't see a flavor in your spoonfuls, you guess it's 0%. But because of the "Zero Mistake" rule, this simple method fails spectacularly in Relative Entropy.

The Classic Fix: Laplace Smoothing (The "Add-One" Rule)

To fix the "Zero Mistake," statisticians use a trick called Laplace Smoothing.

The Metaphor: Instead of saying "I saw 0 cherries, so there are 0 cherries," you pretend you saw one cherry in every category before you even started counting. You add a "ghost cherry" to every jar.
The Result: You never guess 0%. You always give a tiny, non-zero chance to everything. This paper proves that this simple "Add-One" rule is actually the best you can do if you don't know how confident you want to be in your answer.

The New Discovery: Confidence Matters

The paper asks a deeper question: Does it matter how "sure" you want to be?

Imagine you are betting on the weather.

Low Confidence: You just want to be right 90% of the time. The "Add-One" rule works great.
High Confidence: You want to be right 99.999% of the time. The paper shows that the "Add-One" rule isn't quite good enough here. It leaves a tiny bit of risk.

The Solution: The author introduces a "Confidence-Dependent" Smoothing.

The Analogy: If you only care about being right most of the time, you add 1 ghost cherry. But if you need to be extremely sure (like for a nuclear power plant safety check), you add many more ghost cherries. The more confidence you demand, the more you "smooth out" the data to cover your bases.
The Catch: This extra safety comes with a tiny cost: a logarithmic factor (a very slow-growing number). It's the price of being ultra-cautious.

The "Sparse" Problem: When the Soup is Mostly Water

In many real-world cases (like language models), the soup is "sparse." Most flavors are rare or non-existent. You have a bag of 1,000 marbles, but 990 are white, and only 10 are colorful.

The Old Way: The "Add-One" rule treats all 1,000 marbles equally, wasting effort on the 990 white ones.
The New Way: The paper proposes an Adaptive Estimator.
- The Metaphor: Imagine you are a chef who looks at your spoonfuls. If you see only 5 distinct colors, you realize, "Ah, this soup is simple! I don't need to guess about 1,000 flavors, just these 5."
- The algorithm automatically adjusts how much it "smooths" based on how many unique items it actually saw. If the data is sparse, it becomes very efficient, ignoring the noise of the empty categories.

The "Missing Mass" Mystery

A key part of the paper analyzes the "Missing Mass."

The Concept: This is the total probability of all the flavors you didn't see in your sample.
The Analogy: You taste 100 scoops of soup and find 5 flavors. What is the total chance that the next scoop contains a flavor you've never seen before?
The Breakthrough: The paper provides a very sharp, precise formula to predict this "Missing Mass" with high confidence. It tells you exactly how much "unknown territory" you are dealing with, which is crucial for knowing how much you need to "smooth" your guess.

Summary of the Takeaways

The "Add-One" Rule is Great, but not Perfect: It's the best simple method, but if you need extreme certainty, you need to tweak it.
Confidence is a Dial: You can tune your estimator based on how much risk you are willing to take. Higher confidence = more smoothing.
Adapt to the Data: If your data is sparse (few unique items), smart algorithms can ignore the empty categories and perform much better than the old "one-size-fits-all" methods.
The "Zero" Fear: In Relative Entropy, guessing "zero" is fatal. The paper shows exactly how to avoid this trap while staying as accurate as possible.

In short, this paper takes a classic statistical problem, refines the tools we use to solve it, and gives us a better map for navigating the uncertainty of the unknown.

1. Problem Setting

The paper addresses the fundamental statistical problem of estimating an unknown discrete probability distribution $P$ over a finite alphabet of size $d$ , given an i.i.d. sample $X_1, \dots, X_n$ . The accuracy of the estimator $\hat{P}_n$ is measured using the Kullback-Leibler (KL) divergence (relative entropy):
$KL(P, \hat{P}_n) = \sum_{j=1}^d p_j \log \left( \frac{p_j}{\hat{p}_j} \right)$
Unlike other metrics (e.g., Total Variation or Hellinger distance), KL divergence penalizes the assignment of zero probability to classes with non-zero true probability (resulting in infinite loss). This makes the problem particularly challenging in high-dimensional regimes where $d$ is large or comparable to $n$ , and where rare classes may not appear in the sample.

The paper focuses on non-asymptotic, high-probability guarantees. While optimal bounds on the expected risk are known, the paper investigates the tightest possible bounds that hold with probability $1-\delta$ for any $\delta \in (0,1)$ .

2. Methodology and Key Concepts

2.1 Estimators Analyzed

Laplace (Add-One) Estimator: $\hat{p}_j = \frac{N_j + 1}{n + d}$ . The paper analyzes its high-probability performance and establishes it as optimal among "confidence-independent" estimators.
Confidence-Dependent Estimators: Estimators that tune their smoothing parameter $\lambda$ based on the desired confidence level $\delta$ . The paper proposes $\hat{p}_{j, \delta} = \frac{N_j + \lambda_\delta}{n + \lambda_\delta d}$ where $\lambda_\delta = \max(1, \frac{\log(1/\delta)}{d})$ .
Adaptive Estimators: Estimators that adapt to the sparsity of the distribution without prior knowledge of the support size. The paper introduces an estimator using data-dependent smoothing: $\hat{p}_j = \frac{N_j + \hat{\lambda}}{n + \hat{\lambda}d}$ , where $\hat{\lambda} = D_n/d$ and $D_n$ is the number of distinct classes observed.

2.2 Effective Sparsity Parameters

To handle distributions where the effective support is smaller than $d$ , the paper defines two key parameters:

Effective Support Size ( $s_n(P)$ ): $s_n(P) = \sum_{j=1}^d \min(np_j, 1)$ . This roughly counts the number of classes with probability $\gtrsim 1/n$ .
Effective Missing Support Size ( $s^\circ_n(P)$ ): $s^\circ_n(P) = \sum_{j=1}^d \min(e^{1-np_j}, np_j)$ . This parameter captures the complexity of classes that are likely to be missing from the sample or significantly underestimated. It is closely related to the expected "missing mass."

2.3 Analytical Tools

Risk Decomposition: The KL error is decomposed into three parts:
1. The squared Hellinger distance between the empirical distribution and the truth (controlled via reverse KL concentration).
2. The bias introduced by smoothing (regularization).
3. The contribution of underestimated frequencies (classes where $\hat{p}_j \ll p_j$ ). This is the dominant term in high-probability bounds.
Poissonization and Stochastic Domination: The proofs utilize Poisson sampling techniques to decouple dependent counts ( $N_j$ ) and employ Latała's moment inequalities for sums of independent random variables with super-exponential tails.
Missing Mass Analysis: A sharp high-probability bound is derived for the "underestimated mass" (total probability of classes with empirical frequency $\le p_j/4$ ), which is crucial for controlling the third term in the risk decomposition.

3. Key Contributions and Results

3.1 Optimality of the Laplace Estimator (Confidence-Independent)

The paper establishes that the classical Laplace estimator is minimax optimal among estimators that do not depend on the confidence parameter $\delta$ .

Upper Bound: With probability $1-\delta$ , $KL(P, \hat{P}_n) \lesssim \frac{d + \log(1/\delta)\log\log(1/\delta)}{n}$ .
Lower Bound: The paper proves that any confidence-independent estimator must incur an extra $\log\log(1/\delta)$ factor in the deviation term.
Significance: This establishes a fundamental limit: to achieve the ideal asymptotic rate without the $\log\log$ penalty, one must use a confidence-dependent estimator.

3.2 Minimax-Optimal Guarantees (Confidence-Dependent)

By allowing the estimator to depend on $\delta$ , the paper removes the $\log\log(1/\delta)$ factor.

Upper Bound: The confidence-dependent estimator achieves $KL(P, \hat{P}_{n,\delta}) \lesssim \frac{d + \log(d)\log(1/\delta)}{n}$ .
Lower Bound: The paper proves a matching lower bound, showing that the $\log(d)$ factor in the deviation term is unavoidable even for confidence-dependent estimators.
Significance: This characterizes the exact "sample complexity" for high-probability estimation in relative entropy, separating it from the asymptotic rate which lacks the $\log(d)$ term.

3.3 Adaptation to Sparsity

For distributions with small effective support, the paper provides adaptive guarantees that depend on $s_n(P)$ and $s^\circ_n(P)$ rather than the total alphabet size $d$ .

Adaptive Estimator: The proposed estimator $\hat{P}^{ad}_n$ achieves a high-probability bound of order:
$\frac{s_n(P) + s^\circ_n(P) \log(ed/s^\circ_n(P)) + \log(d)\log(1/\delta)}{n}$
Significance: This result shows that estimation is possible even when $n < d$ , provided the distribution is sparse. It matches the minimax lower bounds for sparse distributions and improves upon previous expectation-based bounds (e.g., by Falahatgar et al.) by providing high-probability guarantees and removing unnecessary logarithmic factors in certain regimes (e.g., geometric decay).

3.4 Sharp Bound on Missing Mass

A significant technical contribution is a sharp high-probability upper bound on the missing mass (and underestimated mass).

Result: $M_n \lesssim \frac{s^\circ_n(P) + \log(1/\delta)}{n}$ .
Significance: This bound is nearly optimal and improves upon previous distribution-free bounds (e.g., McAllester & Schapire, Ben-Hamou et al.) which often suffered from suboptimal dependencies on $d$ or $\epsilon$ in specific regimes.

4. Significance and Impact

Resolution of Open Questions: The paper settles the question of the optimal high-probability rate for discrete distribution estimation in KL divergence. It clarifies the trade-offs between confidence-independent and confidence-dependent estimators.
Computational Efficiency: Unlike previous state-of-the-art high-probability bounds that required computationally expensive estimators (involving integration over the simplex), the optimal bounds here are achieved by simple, computationally efficient smoothing rules (Laplace and its variants).
Sparsity Adaptation: The work provides a rigorous theoretical justification for adaptive smoothing techniques used in practice (like Kneser-Ney smoothing in NLP), showing they can achieve minimax-optimal rates without prior knowledge of sparsity.
Methodological Advances: The derivation of sharp tail bounds for the missing mass and the use of moment-based techniques (avoiding moment generating functions due to super-exponential tails) offer new tools for analyzing concentration in discrete settings.

In summary, this paper provides a comprehensive characterization of the statistical limits of estimating discrete distributions in relative entropy, bridging the gap between asymptotic theory and finite-sample high-probability guarantees, and offering practical, optimal estimators for both dense and sparse regimes.