Dirichlet kernel density estimation on the simplex with missing data

Here is an explanation of the paper, translated into simple language with creative analogies.

The Big Picture: Estimating a Recipe with Missing Ingredients

Imagine you are a chef trying to figure out the "perfect recipe" for a soup. You have a huge pot of soup, and you want to know exactly how much salt, pepper, and herbs are in it. In statistics, this is called density estimation: trying to map out the shape of a data distribution.

But here's the catch: your soup is compositional. This means the ingredients must always add up to 100%. If you have more salt, you must have less pepper. In math, this is called the Simplex. It's like a triangle where every point represents a different mix of three ingredients that sum to one.

Now, imagine that while you are tasting the soup, some of your tasters go missing. Maybe they left early, or maybe they forgot to write down their notes. This is Missing Data.

The problem is: Why did they leave?

If they left randomly (like a coin flip), it's easy to fix.
But in real life, they usually leave for a reason. Maybe the tasters who left were the ones who thought the soup was too salty, or maybe they left because the kitchen was too hot (a variable you can see). This is called Missing at Random (MAR).

If you just ignore the missing tasters and only analyze the ones who stayed, your recipe will be wrong. You might think the soup is perfect because the only people who stayed were the ones who liked it.

The Solution: The "Weighted" Chef

The authors of this paper propose a clever way to fix the recipe without guessing what the missing tasters said (a method called imputation). Instead, they use a technique called Inverse Probability Weighting (IPW).

Think of it like this:
Imagine you have a list of 100 tasters. 20 of them are missing. You notice that the missing ones were mostly people who arrived late (a variable you can see).

The paper's method calculates the "probability" that a taster would show up based on their arrival time.
If a taster who arrived late did show up, the method says, "Hey, you are rare! You represent not just yourself, but also the 4 other late people who didn't show up."
So, you give that late taster a heavier vote (a weight) in your final recipe calculation.
If a taster who arrived early shows up, they get a normal vote, because there are plenty of early tasters.

By giving the right people more "weight," you can reconstruct the true flavor of the soup even though some people are missing.

The Special Tool: The Dirichlet Kernel

Now, how do you actually calculate the recipe? Standard math tools (like standard kernels) are designed for normal numbers (like height or weight). They don't work well for "recipes" because they don't respect the rule that ingredients must sum to 100%. They might accidentally suggest a soup with 110% ingredients or negative salt.

The authors use a special tool called the Dirichlet Kernel.

Analogy: Imagine a standard ruler that can measure negative lengths. It's great for a road, but terrible for a recipe where you can't have negative sugar.
The Dirichlet Kernel is like a "smart ruler" that is shaped exactly like the triangle of possible recipes. It naturally fits inside the triangle. It knows that if you are near the edge (e.g., almost 100% salt), the shape of the data changes, and it adjusts its measurement to stay accurate. It ensures the final recipe always makes sense (non-negative and sums to 1).

The Two-Step Dance

The paper describes a two-step process to get the best result:

Step 1: Guess the "Likelihood of Showing Up."
Since we don't know exactly why people are missing, we have to guess. The authors use a statistical "guessing game" (Nadaraya-Watson regression) to look at the people who did show up and figure out the pattern. "Oh, people with high BMI are less likely to have their blood test results recorded." This gives us the weights.
Step 2: The Weighted Recipe.
We take our special "Smart Ruler" (Dirichlet Kernel) and apply it to the data, but we multiply every data point by its "weight" from Step 1. This corrects the bias caused by the missing people.

What Did They Find?

The authors ran thousands of computer simulations to test this method.

The Result: Their method worked better than the old ways of handling this data. The old ways tried to stretch the "recipe triangle" into a flat sheet of paper (using log-ratios) to use standard tools. But this stretching distorts the data near the edges.
The Winner: Their method kept the data in its natural "triangle" shape and used the weighted voting system. It was more accurate, especially when there was a lot of missing data.

Real-World Example: The Blood Test

To prove it works in the real world, they used data from the NHANES survey (a massive US health study).

The Data: They looked at white blood cell counts (Neutrophils, Lymphocytes, and Others). These are percentages that must add up to 100%.
The Problem: Sometimes, the lab loses a sample, so the whole blood count is missing.
The Application: They used their method to find the "average" immune profile of the population, even though some people's data was missing.
The Discovery: They found a "mode" (the most common profile): roughly 57% Neutrophils, 32% Lymphocytes, and 11% Others. This tells doctors what a "healthy, typical" immune balance looks like in the general population, even with the missing data.

Summary

In short, this paper teaches us how to:

Respect the shape of the data (keeping it in the "triangle" of percentages).
Fix missing data by giving the right people more "votes" based on why they were likely to be missing.
Get a more accurate picture of the population than older methods that try to force square pegs into round holes.

It's a new, smarter way to listen to the whole choir, even when some singers have left the room.

Here is a detailed technical summary of the paper "Dirichlet kernel density estimation on the simplex with missing data" by Daayeb et al.

1. Problem Statement

The paper addresses the challenge of nonparametric density estimation for compositional data (data vectors with non-negative components summing to one) supported on the probability simplex ( $S_d$ ). A critical complication in real-world applications (e.g., microbiome studies, geochemistry, NHANES health data) is the presence of Missing At Random (MAR) mechanisms.

The Gap: Standard density estimators fail on the simplex due to boundary bias and the closure constraint. Furthermore, traditional approaches to missing data, such as imputation, reconstruct missing values before estimation, which introduces indirect modeling errors.
The Specific Challenge: Existing methods for handling missing data (like Inverse Probability Weighting, IPW) are typically designed for Euclidean spaces. Applying them directly to the simplex is difficult because standard kernels (like Gaussian) do not respect the simplex geometry or the boundary behavior required for compositional data.

2. Methodology

The authors propose a Inverse Probability Weighted (IPW) Dirichlet Kernel Density Estimator (KDE) that operates directly on the simplex without imputing missing values.

A. The Estimator Framework

Let $Y \in S_d$ be the compositional response and $X \in \mathbb{R}^p$ be fully observed covariates. The missingness indicator is $\delta_i \in \{0, 1\}$ . The propensity score is $\pi(X) = P(\delta=1|X)$ .

Pseudo Estimator (Known Propensities):
If $\pi(X)$ were known, the estimator would be:
$\tilde{f}_{n,b}(s) = \frac{1}{n} \sum_{i=1}^n \frac{\delta_i}{\pi(X_i)} \kappa_{s,b}(Y_i)$
where $\kappa_{s,b}$ is the Dirichlet kernel, an asymmetric kernel specifically designed for the simplex that ensures non-negativity and adapts to the boundary.
Feasible Estimator (Unknown Propensities):
Since $\pi(X)$ is usually unknown, the authors estimate it using a Nadaraya–Watson regression:
$\hat{\pi}_i = \frac{\sum_{j=1}^n \delta_j K^*_h(X_i - X_j)}{\sum_{j=1}^n K^*_h(X_i - X_j)}$
The final feasible estimator replaces $\pi(X_i)$ with $\hat{\pi}_i$ :
$\hat{f}_{n,b}(s) = \frac{1}{n} \sum_{i=1}^n \frac{\delta_i}{\hat{\pi}_i} \kappa_{s,b}(Y_i)$

B. Bandwidth Selection

To select the smoothing parameter $b$ for the Dirichlet kernel, the authors adapt the Least-Squares Cross-Validation (LSCV) criterion to the IPW setting. They minimize an objective function that includes a leave-one-out correction and approximates the Integrated Squared Error (ISE) over an interior grid of the simplex to avoid boundary issues.

3. Key Theoretical Contributions

The paper provides a rigorous asymptotic analysis of the proposed estimators under standard regularity conditions (Lipschitz continuity, bounded support, etc.).

A. Asymptotic Properties of the Pseudo Estimator ( $\tilde{f}_{n,b}$ )

Bias: The pointwise bias is identical to the full-data Dirichlet KDE: $O(b)$ . The missingness mechanism does not introduce additional bias terms.
Variance: The variance is inflated by a factor dependent on the missingness mechanism:
$\text{Var}(\tilde{f}_{n,b}(s)) \approx n^{-1}b^{-d/2} \psi(s) f(s) (1 + \zeta(s))$
where $\zeta(s) = E[(1-\pi(X))/\pi(X) | Y=s]$ . This quantifies the efficiency loss due to missing data.
Optimal Rate: The optimal bandwidth scales as $b \sim n^{-2/(d+4)}$ , yielding an optimal Mean Squared Error (MSE) rate of $O(n^{-4/(d+4)})$ .
Asymptotic Normality: The estimator is asymptotically normal.

B. Asymptotic Properties of the Feasible Estimator ( $\hat{f}_{n,b}$ )

Bias: The bias remains dominated by the Dirichlet kernel term ( $O(b)$ ). The error from estimating the propensity score is of order $O(n^{-2/(p+4)})$ , which is negligible if $p < d$ .
Variance: A significant theoretical finding is that estimating the propensity scores introduces a second-order variance reduction term ( $-n^{-1}\xi(s)$ ). This implies that, under certain conditions, estimating the nuisance parameter does not inflate the variance at the first order; in fact, it can slightly reduce it.
Dimensionality Constraint: The authors prove asymptotic normality requires $p < d$ (the dimension of covariates must be less than the dimension of the simplex). If $p \geq d$ , the "curse of dimensionality" in estimating the propensity score dominates the density estimation error.

4. Simulation Results

A comprehensive Monte Carlo study was conducted using two Dirichlet mixture models with varying sample sizes ( $n=100$ to $800$) and missing rates (5% to 40%).

Performance Metrics: The estimators were evaluated using Integrated Squared Error (ISE).
Comparisons: The proposed IPW Dirichlet KDE was compared against:
1. IPW KDE based on Additive Log-Ratio (alr) transformation.
2. IPW KDE based on Isometric Log-Ratio (ilr) transformation.
Findings:
- The proposed IPW Dirichlet KDE consistently outperformed the log-ratio based alternatives across all models, sample sizes, and missing rates.
- The method demonstrated stability even under high missing rates (up to 40%), with performance improving systematically as sample size increased.
- The log-ratio methods suffered from boundary distortion and Jacobian transformation errors, which the Dirichlet kernel naturally avoids.

5. Real-Data Application

The methodology was applied to leukocyte composition data from the NHANES 2017–2018 survey.

Data: White blood cell differentials (Neutrophils, Lymphocytes, Others) treated as compositional data ( $d=2$ ).
Missingness: The differential was missing as a block for some participants. The missingness was modeled as MAR conditional on Body Mass Index (BMI) ( $p=1$ ).
Result: The estimated density revealed a clear, biologically coherent mode at approximately (57% Neutrophils, 32% Lymphocytes, 11% Others). This mode corresponds to a healthy immunological homeostasis, validating the method's ability to identify population characteristics despite missing data.

6. Significance and Conclusion

Methodological Innovation: This is the first work to successfully combine Inverse Probability Weighting with Dirichlet Kernel Density Estimation for compositional data under MAR mechanisms.
Theoretical Depth: It establishes that IPW does not alter the bias of the Dirichlet estimator but inflates variance, and it clarifies the conditions ( $p < d$ ) under which estimating propensity scores remains asymptotically efficient.
Practical Utility: By avoiding imputation and respecting the simplex geometry, the method provides a robust tool for analyzing complex compositional datasets common in biology, ecology, and public health where data incompleteness is frequent.
Future Directions: The authors suggest extending the framework to handle structural zeros (common in microbiome data), developing uniform confidence bands, and adapting the method for complex survey designs with sampling weights.

In summary, the paper provides a theoretically sound and practically superior alternative to existing log-ratio transformation methods for density estimation of compositional data with missing values.

Dirichlet kernel density estimation on the simplex with missing data

The Big Picture: Estimating a Recipe with Missing Ingredients

The Solution: The "Weighted" Chef

The Special Tool: The Dirichlet Kernel

The Two-Step Dance

What Did They Find?

Real-World Example: The Blood Test

Summary

1. Problem Statement

2. Methodology

A. The Estimator Framework

B. Bandwidth Selection

3. Key Theoretical Contributions

A. Asymptotic Properties of the Pseudo Estimator (f~n,b\tilde{f}_{n,b}f~​n,b​)

B. Asymptotic Properties of the Feasible Estimator (f^n,b\hat{f}_{n,b}f^​n,b​)

4. Simulation Results

5. Real-Data Application

6. Significance and Conclusion

More like this

The *-variation of the Banach-Mazur game and forcing axioms

Modified averaged vector field methods preserving multiple invariants for conservative stochastic differential equations

The probabilistic superiority of stochastic symplectic methods via large deviations principles

Hodge-Gromov-Witten theory

Large deviations principles for symplectic discretizations of stochastic linear Schrödinger Equation

A. Asymptotic Properties of the Pseudo Estimator ( $\tilde{f}_{n,b}$ )

B. Asymptotic Properties of the Feasible Estimator ( $\hat{f}_{n,b}$ )