Identifiability of Potentially Degenerate Gaussian Mixture Models With Piecewise Affine Mixing

Imagine you are a detective trying to solve a mystery, but you don't have the suspects in front of you. Instead, you only have a pile of blurry, distorted photographs. Your goal is to figure out exactly who the people in the photos are, what they look like, and how they are related to each other.

This is the core problem of Causal Representation Learning: trying to find the hidden "real world" causes (the latent variables) behind the messy data we see (the observations).

This paper tackles a very specific, tricky version of this mystery. Here is the breakdown in simple terms:

1. The Mystery: The "Broken" Clues

Usually, scientists assume that hidden variables are like smooth, round balloons (Gaussian distributions). But in the real world, things are often "flat" or "broken."

The Analogy: Imagine trying to describe a 3D object, but sometimes the object is just a flat sheet of paper, or even a single line. In math, these are called degenerate distributions. They are "broken" because they don't have a standard volume; they are squashed.
The Problem: Most detective tools (mathematical formulas) break when they encounter these flat, squashed objects. They rely on the object having a "thickness" (a probability density), which these flat objects lack.

2. The Twist: The "Piecewise" Mirror

The paper also assumes the camera distorting the photos isn't just a simple blur. It's a Piecewise Affine Mixing Function.

The Analogy: Imagine looking at the world through a funhouse mirror made of many different flat glass panels. Some panels stretch the image, some shrink it, some flip it. But once you cross the line from one panel to another, the distortion changes completely. It's not a smooth curve; it's a jagged, step-by-step transformation.

3. The Solution: The "Sparse" Detective

The authors ask: Can we still solve the mystery if the objects are flat and the mirror is jagged?
Their answer is Yes, but they need a special trick: Sparsity.

The Analogy: Think of a "sparse" object as a skeleton. It has bones (active parts) but no flesh (inactive parts). In many real-world scenarios (like language or images), only a few things are "active" at any given time.
The Trick: The authors realized that if they force their AI to find a solution that is "sparse" (keeping the skeleton simple and ignoring the noise), they can mathematically prove that they have found the only correct solution.

4. The Three-Step Investigation

The paper proves that they can identify the hidden variables in three stages of increasing clarity:

Stage 1: The "Local" Map (Affine within components)
- What it means: They can figure out the shape of the flat objects, but they might be rotated or stretched differently in different parts of the picture.
- Analogy: They know the suspects are there, but in one part of the photo, Suspect A looks like a tall, thin giant, and in another part, they look like a short, wide dwarf. They know it's the same person, but the "rules" change depending on where you look.
Stage 2: The "Global" Map (Affine everywhere)
- What it means: They prove that if all the flat objects share a common "skeleton structure" (a common basis), the distortion rules are actually the same everywhere.
- Analogy: They realize the funhouse mirror isn't actually changing the rules randomly. It's just one consistent set of rules applied to the whole room. Now they can map the whole picture consistently.
Stage 3: The "Perfect" Map (Permutation and Scaling)
- What it means: This is the holy grail. They prove that if the "skeleton" is sparse enough, they can identify exactly which variable is which, up to just swapping their names (permutation) or changing their size (scaling).
- Analogy: They finally put on their glasses and say, "That's definitely the guy in the red hat, and that's the girl with the blue scarf." They have perfectly disentangled the mess.

5. The Experiment: From Math to Reality

The authors didn't just do math on paper; they built a two-stage AI system to test this.

Stage 1: They taught an AI to reconstruct the images from the blurry photos, forcing it to learn the "flat" structures.
Stage 2: They added a "sparsity penalty" (a rule that says "keep it simple") to force the AI to untangle the variables.

The Results:

Synthetic Data: They created fake data with flat, broken shapes and jagged mirrors. Their method worked perfectly, recovering the hidden variables much better than previous methods.
Image Data: They used a dataset of moving balls. Sometimes a ball would stop moving (becoming "flat" or degenerate). Their method successfully figured out the position of every ball, even when some were frozen in place.

Why This Matters

In the real world, data is rarely perfect. Objects often have hidden structures that are "flat" (like a 2D pattern on a 3D surface) or "sparse" (only a few features matter).

Previous methods failed when data was "broken" or "flat."
This paper provides a mathematical guarantee that even with broken, flat data and jagged distortions, we can still find the truth—if we assume the truth is sparse.

In a nutshell: The authors found a way to solve a puzzle that everyone thought was broken, by realizing that the "broken" pieces actually fit together perfectly if you look for the empty spaces (sparsity) between them.

1. Problem Statement

The paper addresses the challenge of Causal Representation Learning (CRL): recovering latent variables ( $Z$ ) from high-dimensional observations ( $X$ ) where the variables exhibit complex dependencies.

The Setting: The observations are generated via an unknown, injective, continuous piecewise affine mixing function $f$ such that $X = f(Z)$ .
The Distribution: The latent variables $Z$ follow a Potentially Degenerate Gaussian Mixture Model (pdGMM). Unlike standard Gaussian Mixture Models (GMMs), pdGMMs allow components to have singular covariance matrices (degenerate), meaning the probability density function (PDF) is ill-defined or zero on the full space $\mathbb{R}^n$ .
The Challenge: Traditional identifiability results for GMMs rely on the analyticity of the PDF, which fails for degenerate components. Furthermore, standard CRL methods often require auxiliary data (interventions, temporal structure, or multi-view data) to achieve identifiability. This paper aims to achieve identifiability without such auxiliary data, relying solely on parametric assumptions on $Z$ and $f$ .

2. Methodology and Theoretical Framework

The authors establish a series of progressively stronger identifiability results, moving from distribution recovery to recovering the latent variables up to specific equivalence classes.

A. Identifiability of pdGMMs from an Open Subset

Core Insight: Since the PDF is ill-defined for degenerate Gaussians, the authors cannot use standard density-based proofs. Instead, they prove that if two pdGMMs agree on an open subset $E$ that intersects the support of every component, they are identical over the entire domain.
Proof Strategy: They project the high-dimensional pdGMM into lower-dimensional subspaces where the degenerate components become non-degenerate (full rank). By applying classical identifiability results (Yakowitz & Spragins, 1968) on these projections and leveraging the fact that the open set intersects all supports, they recover the unique parameters of the mixture.

B. Identifiability of Latent Variables (Progressive Results)

The paper defines three levels of identifiability for the learned representation $g(X)$ :

Identifiability up to Affine Transformation within Components (ATwC):
- Assumption: Genericity of pdGMMs (Assumption 3.4). This ensures that overlapping components with the same rank are distinguishable by their Mahalanobis distance at intersection points.
- Result: Under perfect reconstruction and enforcing Gaussianity, $g(X)$ recovers $Z$ up to an affine transformation that may differ for each mixture component.
Identifiability up to Global Affine Transformation (AT):
- Assumption: Common Basis and Translation Vector (Assumption 3.6). The supports of all components must intersect at a common point, and their subspaces must be spanned by subsets of a shared global basis.
- Result: The transformation $g \circ f$ becomes a single global affine transformation across all components, not just within them.
Identifiability up to Permutation and Scaling (PS) (Disentanglement):
- Assumption: Common Standard Basis and Sufficient Support Basis Index Variability (Assumption 3.8). This implies the latent variables are sparse (some dimensions are zero/constant for specific components) and the sparsity patterns vary sufficiently across components to cover all dimensions.
- Sparsity Regularization: The authors enforce a sparsity constraint on the learned representation: $\mathbb{E}[\|g(X)\|_0] \leq \mathbb{E}[\|Z\|_0]$ .
- Result: Under these conditions, $g(X)$ identifies $Z$ up to permutation and scaling (element-wise linear transformation), achieving complete disentanglement without auxiliary data.

3. Implementation: Two-Stage Algorithm

To implement these theoretical guarantees, the authors propose a two-stage optimization method:

Stage 1 (Recovery of AT):
- Trains an autoencoder ( $g_{\psi_1}, \hat{f}_{\theta_1}$ ) to minimize reconstruction error and enforce a Gaussian prior on the latent codes.
- Goal: Achieve identifiability up to a global affine transformation (Theorem 3.7).
Stage 2 (Disentanglement to PS):
- Freezes the Stage 1 encoder and trains a second, inner affine autoencoder ( $g_{\psi_2}, \hat{f}_{\theta_2}$ ).
- Goal: Enforce sparsity ( $L_1$ norm approximation of $L_0$ ) to satisfy the conditions of Theorem 3.9.
- Optimization: Uses a Lagrangian-based toolkit (Cooper) with Adam and extra-gradient variants to handle the constrained optimization.

4. Experimental Results

The authors evaluate their method on synthetic numerical data and an image dataset ("Multiple Balls").

Numerical Experiments:
- Setup: Varied latent dimensions ( $n=5$ to $40$), causal graph density, mixing function complexity, and degeneracy ratios.
- Metrics: $R^2$ (for AT) and Mean Correlation Coefficient (MCC) (for PS).
- Findings:
  - Stage 1 consistently achieves high $R^2$ ( $>0.9$ ), confirming global affine identifiability.
  - Stage 2 significantly improves MCC (up to $0.97$) compared to Stage 1 alone (which yields low MCC), proving the necessity of sparsity for disentanglement.
  - The method outperforms the baseline VaDE (Kivva et al., 2022), which fails on degenerate data.
  - The method is robust to over-parameterization of latent dimensions and variations in causal graph density.
Image Dataset (Multiple Balls):
- Task: Recover $(x, y)$ positions of moving balls from images. Some balls are stationary (degenerate latent dimensions).
- Results: The method successfully recovers ball positions with high $R^2$ . It demonstrates that even with occlusion and degeneracy, the latent structure is recoverable, though performance drops slightly with high occlusion ( $b=6$ ).
Ablation Studies:
- Assumption Violations: Performance degrades significantly when Assumptions 3.4 (Genericity), 3.6 (Common Basis), or 3.8 (Sparsity) are violated, validating the theoretical necessity of these conditions.
- Robustness: The method remains robust even if the mixing function is smooth (Leaky-ReLU) or the latent distribution is non-Gaussian (Exponential/Gumbel), though it struggles with Sigmoid activations.

5. Key Contributions

Theoretical Extension: First to provide identifiability guarantees for degenerate GMMs with piecewise affine mixing, overcoming the lack of well-defined PDFs.
Progressive Identifiability: Establishes a hierarchy of results (ATwC $\to$ AT $\to$ PS) based on increasingly specific structural assumptions (Genericity $\to$ Common Basis $\to$ Sparsity).
No Auxiliary Data: Achieves full disentanglement (up to permutation and scaling) without requiring interventions, temporal data, or multi-view observations, relying only on the sparsity of the latent representation.
Practical Algorithm: Proposes a two-stage training pipeline that effectively implements these theoretical results, validated on both synthetic and image data.

6. Significance

This work bridges a critical gap in causal representation learning. By handling degenerate distributions (common in real-world sparse data like language models or occluded scenes) and piecewise affine mixing (common in ReLU networks), it provides a rigorous theoretical foundation for learning disentangled representations without the heavy burden of collecting auxiliary intervention data. The results suggest that sparsity is a powerful inductive bias for achieving identifiability in complex, dependent latent spaces.