Observable Geometry of Singular Statistical Models

Imagine you are trying to describe a complex 3D object, like a sculpture, to someone who can only see it through a specific, narrow window.

In traditional statistics, scientists usually describe this object by listing the coordinates of the artist's hands while they were sculpting it (the parameters). But here's the problem: sometimes, the artist can move their hands in completely different ways and still end up with the exact same sculpture. If you only look at the hand movements, you get confused. You think the object is changing, but it's actually the same. This is what statisticians call a "singular model"—a situation where different inputs create the exact same output, making it impossible to tell what's really going on just by looking at the inputs.

This paper, written by Sean Plummer, proposes a radical new way to look at these confusing models. Instead of watching the artist's hands, let's just look at the sculpture itself.

Here is the breakdown of the paper's ideas using simple analogies:

1. The Old Way: Watching the Hands (Parameter Space)

Traditionally, statisticians analyze models by studying the "parameter space." Think of this as a map of all the possible hand movements an artist could make.

The Problem: In many modern models (like neural networks or mixture models), the map is messy. You can move your hand left, right, up, or down, and the sculpture doesn't change at all. The map has "dead zones" where movement doesn't matter.
The Result: When you try to predict how the model learns or behaves, the old math breaks down because it assumes every hand movement changes the result. It's like trying to navigate a city using a map that has extra, fake streets that don't actually exist.

2. The New Way: Looking at the Sculpture (Observable Charts)

Plummer suggests we stop looking at the hands and start looking at the observable features of the sculpture.

The Analogy: Imagine you can't see the artist, but you can measure the sculpture's height, weight, and the color of its paint. These are observables.
The "Chart": The paper introduces "Observable Charts." Think of these as a set of measuring tools. If you have enough tools (measuring height, weight, texture, etc.), you can describe the sculpture perfectly without ever knowing how the artist moved their hands.
The Benefit: This view is "invariant." It doesn't matter if the artist used their left hand, right hand, or a robot arm. If the sculpture looks the same, the measurements are the same. This cuts through the confusion of the "dead zones."

3. The "Invisible" Directions (Singularities)

In these tricky models, some changes are invisible at first glance.

The Analogy: Imagine a balloon. If you squeeze it gently, it changes shape immediately (this is a regular change). But imagine a balloon that is stuck to a table. If you push it sideways, it doesn't move at all. You have to push harder or push in a specific way before it finally starts to budge.
The Paper's Insight: In singular models, some directions are like that stuck balloon. If you make a tiny change to the model, the "observable" (the measurement) doesn't change at all. It looks like nothing happened.
The Solution: The paper introduces a concept called "Observable Order." This is like a sensitivity dial.
- Order 1: You push gently, and the balloon moves. (Standard statistics).
- Order 2: You have to push twice as hard (or look at the second derivative) before the balloon moves.
- Order 3: You need a third-level push.
- The paper shows that by looking at these higher-order "pushes," we can finally see the hidden structure that the old math missed.

4. The Big Discovery: How Fast Does the Model Learn?

The most important result of the paper is a connection between these "pushes" and how fast a model learns (measured by something called Kullback-Leibler divergence, which is just a fancy way of saying "how different two models are").

The Rule: The paper proves that the "Observable Order" sets a speed limit.
- If a change is visible immediately (Order 1), the model learns fast (the error drops quickly).
- If a change is hidden and only visible at Order 2, the model learns much slower.
- If it's Order 3, it's even slower.

This explains why some complex AI models learn slowly or get stuck. It's not a bug; it's a geometric feature of the sculpture itself. The "stuck" directions take longer to reveal themselves.

5. Real-World Examples

The paper tests this idea on two common scenarios:

Gaussian Mixtures (Clustering): Imagine trying to find two groups of people in a crowd. If the groups are identical, you can't tell them apart. The paper shows that you need to look at the "skewness" (the tilt) of the crowd to tell them apart, not just the average position.
Neural Networks: In a neural network, sometimes a neuron is "dead" (it outputs zero). The paper shows that you can't detect if you tweak the settings of that dead neuron just by looking at the output once. You have to look at how the output changes when you tweak it slightly, and then tweak it again, to see the hidden structure.

Summary: Why This Matters

This paper is like giving statisticians a new pair of glasses.

Old Glasses: Focused on the inputs (the parameters). They got blurry when the inputs were redundant.
New Glasses: Focus on the outputs (the observables). They remain sharp even when the inputs are messy.

By focusing on what we can actually see and measure (the data distribution) rather than how the model is built (the parameters), we get a clearer, more honest picture of how complex models behave. It unifies the math for simple models and the confusing, "singular" models used in modern AI, showing that they are all just different levels of the same geometric landscape.

In a nutshell: Don't ask "How did the artist move their hand?" Ask "What does the sculpture look like?" and measure how hard you have to push to see it change. That tells you everything you need to know about how the model learns.

1. Problem Statement

Classical statistical theory relies on the assumption that statistical models are regular, meaning the mapping from parameters to probability distributions is locally injective (identifiable) and the Fisher information matrix is non-degenerate. Under these conditions, asymptotic behavior is governed by first-order properties (score functions and quadratic log-likelihood expansions).

However, many modern models (e.g., mixture models, neural networks, latent variable models) are singular. In these settings:

Distinct parameter values can induce the same probability distribution (non-identifiability).
The Fisher information matrix becomes degenerate.
Classical asymptotic theory breaks down, and standard tools like the score function fail to capture the true geometric structure of the model.

Existing approaches, such as Singular Learning Theory (SLT), analyze these phenomena by resolving singularities in the parameter space ( $\Theta$ ). While powerful, these methods are heavily dependent on specific parameterizations, obscuring the intrinsic statistical structure which resides in the model space ( $M$ , the set of realizable distributions). The paper argues that the intrinsic object of interest is $M$ , not $\Theta$ , and proposes a framework to study $M$ directly without reference to a specific parameterization.

2. Methodology: The Observable Framework

The author introduces a coordinate-free framework based on observable charts. Instead of using parameters, the local structure of the model is described using collections of functionals of the data distribution.

Core Concepts

Observables: Functionals $\psi_f(P) = \mathbb{E}_P[f]$ where $f$ is a measurable function. These represent quantities directly computable from the data distribution.
Observable Charts: A finite collection of observables $\Psi = (\psi_{f_1}, \dots, \psi_{f_m})$ that maps the model space $M$ into Euclidean space $\mathbb{R}^m$ . This provides a finite-dimensional presentation of the model.
Observable Completeness: A chart is k-th order complete if it can distinguish between distributions that differ at order $k$ or higher. Specifically, if the difference in observables is $o(t^k)$ , the difference in distributions is also $o(t^k)$ .
Observable Order: For an analytic curve $\gamma(t)$ $γ (t)$ in the model space, the observable order $o_\Psi(\gamma)$ $o_{Ψ} (γ)$ is the smallest integer $k$ $k$ such that the expansion of $\Psi(\gamma(t)) - \Psi(\gamma(0))$ $Ψ (γ (t)) - Ψ (γ (0))$ has a non-zero term of order $t^k$ $t^{k}$ .
- If $o_\Psi(\gamma) = 1$ , the direction is visible at first order (regular).
- If $o_\Psi(\gamma) > 1$ , the direction is "hidden" at first order and only detectable via higher-order effects (singular).

Construction Procedure

The paper outlines an iterative procedure to construct these charts:

Start with natural observables (e.g., moments).
Compute the Jacobian to identify directions invisible to first-order derivatives.
Probe these "hidden" directions with analytic curves.
Add new observables that become non-zero at the lowest possible order along these curves.
Iterate until all directions of interest are detectable at some finite order.

3. Key Contributions and Results

A. Recovery of Classical Geometry (First-Order)

Theorem 1 (Observable Tangent Theorem): The paper proves that if the class of observables is dense in $L^2(P_{\theta_0})$ , the observable derivatives separate identifiable directions exactly as the score function does.

Result: The geometry induced by observable derivatives coincides with the classical Fisher geometry on identifiable directions.
Implication: Classical statistical theory is reinterpreted as a first-order theory of observable functionals. In regular models, first-order observables provide a complete local coordinate system.

B. Characterization of Singularities (Higher-Order)

In singular models, certain directions are invisible to first-order derivatives ( $o_\Psi(\gamma) > 1$ ). The paper defines Observable Order to quantify the rate at which these directions become distinguishable.

Theorem 2 (Observable Order Controls KL Order):
Let $\Psi$ be a first-order complete observable chart. For any analytic curve $\gamma$ , the order of the Kullback-Leibler (KL) divergence decay, $o_K(\gamma)$ , satisfies:
$o_K(\gamma) \geq 2 \cdot o_\Psi(\gamma)$

Significance: This establishes a direct link between the intrinsic geometric structure (observable order) and statistical distinguishability (KL divergence).
Interpretation: In regular models, $o_\Psi=1 \implies o_K=2$ (quadratic decay). In singular models, if a direction is invisible at first order ( $o_\Psi=2$ ), the KL divergence decays quartically ( $o_K=4$ ), explaining the slower learning rates and different asymptotic behavior observed in singular models.

C. Illustrative Examples

The framework is applied to three canonical models to demonstrate how observable charts reveal structure:

Gaussian Mixture Models: Near the singularity where component weights and means coincide, the mean is observable at order 1, the separation parameter at order 2, and the weight asymmetry at order 3.
Neural Networks (Inactive Units): At a singularity where a hidden unit is inactive ( $a=0$ ), the weights $w$ and bias $b$ are invisible at first order. They only appear in mixed second-order terms (e.g., $a(w-w_0)$ ).
Reduced-Rank Regression: The rank constraint is invisible to first-order derivatives of the cross-moment matrix. It only appears as a quadratic relation (determinant constraint) at second order. The paper verifies that for a curve passing through the singularity, $o_\Psi=2$ and $o_K=4$ , confirming the theorem.

4. Significance and Impact

Parameterization Invariance: The framework operates directly on the model space $M$ , providing a description of singularities that is independent of how the model is parameterized. This resolves the issue of "parameterization artifacts" that plague existing SLT approaches.
Unified Language: It unifies regular and singular statistics under a single geometric umbrella: regular models are simply those where all directions are observable at order 1, while singular models require higher-order expansions.
Connection to Learning Coefficients: The concept of "observable order" parallels the Real Log Canonical Threshold (RLCT) used in Singular Learning Theory. The paper suggests that learning coefficients (which govern asymptotic learning rates) may be reformulated intrinsically in terms of observable expansions, potentially leading to new, parameter-free methods for model selection and generalization error estimation.
Practical Diagnostics: The iterative construction procedure offers a practical method for diagnosing non-identifiability and building reduced representations of complex models by identifying which moments or functionals are necessary to capture the local geometry.

Conclusion

Sean Plummer's paper proposes a paradigm shift from analyzing statistical models via parameter space geometry to analyzing them via observable geometry. By defining local structure through the order of vanishing of expectation functionals, the paper provides a rigorous, intrinsic, and coordinate-free explanation for the behavior of singular statistical models, linking higher-order geometric degeneracies directly to the rate of statistical distinguishability (KL divergence).