Shape-constrained density estimation with Wasserstein projection

Here is an explanation of the paper "Shape-Constrained Density Estimation with Wasserstein Projection," translated into simple, everyday language using analogies.

The Big Picture: Fitting a Shape to a Cloud of Dots

Imagine you have a bag of marbles scattered on the floor. These marbles represent data points collected from the real world (like the heights of people, the prices of houses, or the time it takes to commute).

Your goal is to draw a smooth, continuous curve (a "density") that best describes where these marbles are likely to be found. However, you have a rule: The curve must follow a specific shape.

Rule A (Monotone): The curve must always go down (like a slide). It can't go up.
Rule B (Log-concave): The curve must be "hill-shaped" (like a bell curve or a mountain). It can't have two separate peaks (like a camel's back).

The paper compares two different ways of drawing this curve to fit your marbles.

The Two Competitors: The "Local" vs. The "Global"

1. The Old Way: Maximum Likelihood (The "Local" Approach)

Think of this as the Grenander Estimator (for Rule A) or the Log-Concave MLE (for Rule B).

How it works: This method looks at your marbles and asks, "If I draw a curve right here, does it pass through the most marbles?" It tries to maximize the number of marbles sitting under the curve.
The Metaphor: Imagine you are a tailor trying to fit a suit to a mannequin made of marbles. You stitch the fabric so it hugs every single marble tightly.
The Result: The resulting shape is very "jagged." It changes direction exactly where the marbles are. If you have two marbles at positions -1 and 1, the tailor might draw a flat line strictly between them, ignoring the space outside. It's very precise but can be too rigid.

2. The New Way: Wasserstein Projection (The "Global" Approach)

This is the method the authors are proposing.

How it works: Instead of just counting marbles, this method asks, "How much effort would it take to move my marbles to fit this new shape?" It uses a concept called Optimal Transport (or the Wasserstein distance).
The Metaphor: Imagine your marbles are piles of sand. You want to reshape the sand into a perfect "slide" (monotone) or a "hill" (log-concave).
- The Old Way just says, "Make sure the sand covers the most ground."
- The New Way says, "I want to move the sand into the shape of a hill, but I want to do it with the least amount of physical work." If a pile of sand is far away, moving it costs a lot of energy. The method finds the shape that requires the least "muscle" to transform your messy pile of sand into a perfect hill.
The Key Difference: Because it cares about the distance the sand has to move, it doesn't just hug the marbles. It might spread the sand out a little wider to make the shape smoother and more natural, even if that means the curve doesn't pass directly through every single marble.

What Did the Authors Discover?

The paper proves some cool mathematical facts about this "Global" method:

It's Smoother and Simpler:
- When fitting a Monotone shape (a slide), the new method creates a curve that is made of flat, straight steps (like a staircase).
- When fitting a Log-Concave shape (a hill), the new method creates a curve that is made of smooth, curved segments (like a series of connected arches).
- Crucially: The "steps" or "arches" don't necessarily start and stop exactly where your marbles are. They might be in between. This makes the shape look more natural and less "pixelated."
It Can Be Wider:
- In a famous example, if you have marbles at -1 and 1, the Old Way draws a hill from -1 to 1.
- The New Way (Wasserstein) draws a hill from -1.5 to 1.5.
- Why? Because spreading the sand out a bit makes the "hill" shape smoother and requires less "energy" to form, even though it covers a slightly wider area than the marbles themselves.
It's a Convex Problem:
- Mathematically, finding this shape is like rolling a ball down a bowl. No matter where you start, the ball will always roll to the same bottom point (the best solution). This means computers can solve it very reliably.

Why Should You Care?

In the real world, data is often messy.

The Old Way is great if you believe the data is perfect and you want to capture every tiny detail.
The New Way is better if you believe the data is a bit noisy and you want a shape that represents the underlying truth rather than just the specific spots where the data happened to land.

The authors built computer programs (in the R language) to do this. They tested it on fake data and showed that while the new method sometimes looks "wider" than the old one, it often provides a more robust and stable picture of the data, especially when the data doesn't perfectly fit the rules (which happens often in real life).

Summary Analogy

The Data: A pile of sand.
The Goal: Turn the sand into a perfect slide.
Maximum Likelihood (Old): "I will pile the sand exactly where it is, even if it looks bumpy."
Wasserstein Projection (New): "I will move the sand just enough to make a perfect, smooth slide, using the least amount of effort possible."

The paper shows that this "least effort" approach creates beautiful, simple shapes that are mathematically guaranteed to exist and can be calculated easily.

Here is a detailed technical summary of the paper "Shape-Constrained Density Estimation with Wasserstein Projection" by Takeru Matsuda and Ting-Kam Leonard Wong.

1. Problem Statement

The paper addresses the problem of nonparametric shape-constrained density estimation. Given independent samples $X_1, \dots, X_n$ from an unknown distribution $\mu^*$ , the goal is to estimate $\mu^*$ using a statistical model $\mathcal{F}$ that encodes specific shape constraints (e.g., monotonicity or log-concavity).

Unlike traditional approaches that rely on Maximum Likelihood Estimation (MLE), which minimizes the Kullback-Leibler (KL) divergence, this paper proposes an alternative estimator based on Optimal Transport (OT). The authors investigate the properties of the Wasserstein projection estimator, defined as the distribution in $\mathcal{F}$ closest to the empirical distribution $\mu_n$ under the $p$ -Wasserstein distance ( $W_p$ ).

The study focuses on the univariate setting ( $d=1$ ) with a specific emphasis on the quadratic case ( $p=2$ ), where the geometry of the Wasserstein space simplifies significantly.

2. Methodology

2.1. Wasserstein Projection Framework

The estimator $\hat{\mu}_n$ is defined as the solution to the optimization problem:
$\hat{\mu}_n := \arg\min_{\nu \in \mathcal{F}} W_p(\nu, \mu_n)$
where $\mu_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i}$ is the empirical measure.

To ensure the existence and uniqueness of this projection, the authors impose three conditions on the model $\mathcal{F}$ :

Finite Moments: $\mathcal{F} \subset \mathcal{P}_p(\mathbb{R})$ (distributions with finite $p$ -th moments).
Closedness: $\mathcal{F}$ is closed with respect to the $W_p$ metric.
Displacement Convexity: $\mathcal{F}$ is a displacement convex subset of the Wasserstein space.

In the univariate case, the $p$ -Wasserstein distance is isometric to the $L_p$ distance between quantile functions ( $Q_\mu$ ). Specifically, $W_p(\mu, \nu) = \|Q_\mu - Q_\nu\|_{L_p}$ . Consequently, the problem of finding the Wasserstein projection reduces to projecting the empirical quantile function $Q_{\mu_n}$ onto the set of quantile functions $\mathcal{Q}_\mathcal{F}$ in the $L_p$ space.

2.2. Focus on $p=2$

The authors restrict their analysis to $p=2$ for two main reasons:

Uniqueness: For $p > 1$ , the projection is unique.
Lipschitz Property: When $p=2$ , the projection map is 1-Lipschitz with respect to $W_2$ . This property is crucial for establishing finite-sample consistency and convergence rates, as it allows the error of the estimator to be bounded by the error of the empirical distribution converging to the true distribution.

2.3. Specific Shape Constraints

The paper analyzes two fundamental cases:

Monotone Densities: Densities that are non-increasing on $\mathbb{R}_+ = [0, \infty)$ .
Log-Concave Densities: Densities $f$ on $\mathbb{R}$ such that $\log f$ is concave.

3. Key Contributions and Theoretical Results

3.1. Structural Properties of the Estimator

The most significant theoretical contribution is the characterization of the structural form of the estimated density.

Monotone Case (Theorem 3.6):
If the data consists of points in $(0, \infty)$ , the Wasserstein projection estimator $\hat{\mu}_n$ has a density that is:
- Compactly supported.
- Piecewise constant with finitely many pieces.
- Contrast with MLE: Unlike Grenander's estimator (the MLE for monotone densities), the support of the Wasserstein estimator is generally not the convex hull of the data, and the "break points" (where the density changes value) are not necessarily subsets of the data points.
Log-Concave Case (Theorem 4.7):
If the data is not a single point mass, the estimator $\hat{\mu}_n$ has a density that is:
- Compactly supported.
- Piecewise log-affine (i.e., $\log f$ is piecewise linear) with finitely many pieces.
- Contrast with MLE: Similar to the monotone case, the support is generally wider than the convex hull of the data, and the knots (break points) do not necessarily coincide with the data points.

3.2. Consistency and Convergence

Consistency: The estimator is consistent with respect to the $W_2$ distance. Specifically, $W_2(\hat{\mu}_n, \text{proj}_{\mathcal{F}}\mu^*) \to 0$ almost surely as $n \to \infty$ .
Convergence Rates:
- For general distributions with finite moments, the rate depends on the tail behavior of $\mu^*$ .
- For log-concave true distributions, the paper establishes a parametric convergence rate (up to a logarithmic factor): $E[W_2^2(\hat{\mu}_n, \mu^*)] \leq C \frac{\log n}{n}$ . If the distribution is compactly supported, the rate improves to $O(1/n)$ .

3.3. Geometric Differences from MLE

The paper highlights that while MLE corresponds to a projection under the KL divergence (information geometry), the Wasserstein projection incorporates the Euclidean geometry of the state space.

Example: For data $\mu_n = \frac{1}{2}\delta_{-1} + \frac{1}{2}\delta_1$ , the log-concave MLE yields the uniform distribution on $[-1, 1]$ . However, the 2-Wasserstein projection yields the uniform distribution on $[-1.5, 1.5]$ . The Wasserstein estimator "spreads" the mass further to minimize the transport cost, resulting in a wider support.

4. Implementation and Numerical Experiments

The authors implement the estimators using discretization of the quantile functions on a grid.

Monotone Case: The problem is formulated as a Quadratic Program (QP) over the values of the quantile function, subject to linear constraints ensuring convexity (for the quantile) and non-negativity.
Log-Concave Case: The problem is formulated as a convex optimization over the values of $h = 1/Q'$ , where $h$ must be positive and concave. The objective function involves integrals of the quantile function, which are computed numerically.

Experimental Findings:

The authors compare the Wasserstein projection estimator against the MLE (Grenander's estimator for monotone, standard log-concave MLE for log-concave) using simulated data.
Support: The Wasserstein estimator consistently produces estimates with wider support than the MLE.
Fit: The Wasserstein estimator provides a better fit to the empirical quantile function in the $L_2$ sense, whereas the MLE fits the empirical distribution function (or its convex minorant) more closely.
Misspecification: In misspecified settings (e.g., bimodal true distributions), the Wasserstein estimator offers a different trade-off, often smoothing the density differently than the MLE.

5. Significance and Future Directions

Significance:

New Perspective: This work establishes a rigorous framework for shape-constrained estimation using Optimal Transport, offering a distinct alternative to likelihood-based methods.
Structural Insights: It proves that despite the complexity of the Wasserstein metric, the resulting estimators in 1D retain simple, piecewise structures (piecewise constant or log-affine), making them computationally tractable.
Geometric Interpretation: It clarifies how the underlying geometry (Euclidean vs. Information) affects the shape of the estimator, particularly regarding support and knot locations.

Future Directions (as noted by authors):

Multivariate Extension: Extending these results to $d \geq 2$ is challenging because the space of log-concave distributions is not displacement convex in higher dimensions.
Break Point Analysis: Further investigation into the exact number and location of break points to improve algorithmic efficiency.
Interpolation: Exploring interpolations between Wasserstein and KL projections (e.g., using entropic regularization/Sinkhorn divergence).
Density Properties: Investigating the convergence rates of the estimated density function itself, rather than just the distribution in Wasserstein distance.

In summary, the paper successfully adapts the powerful tools of Optimal Transport to nonparametric density estimation, proving that Wasserstein projection yields well-behaved, structurally simple estimators with unique geometric properties distinct from classical Maximum Likelihood Estimation.