Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

Imagine you are a detective trying to solve a mystery. You have a set of clues (the data) and you've built a theory about what caused them (the "learned representation"). To prove your theory is good, you need a way to measure how well your theory matches the real truth.

In the world of AI, this is called identifiability. Researchers build AI models to find the hidden "factors" behind data (like identifying that a picture of a cat is made of "fur," "ears," and "tail" rather than just a blob of pixels).

For years, scientists have used specific rulers (metrics) to measure how well these AI models are doing. The paper you shared asks a very important question: "Who guards the guardians?" In other words: Are these rulers actually measuring what they claim to measure, or are they broken?

The authors, Shruti Joshi and her team, discovered that most of these rulers are broken. They don't just measure the quality of the AI; they accidentally measure other things too, like the shape of the data or how many samples you have.

Here is the breakdown using simple analogies:

1. The Broken Rulers (The Metrics)

The paper tests three popular "rulers" used by AI researchers: MCC, DCI, and $R^2$ . They found that each one fails in specific, predictable ways.

The "Confused Correlation" Ruler (MCC):
- The Analogy: Imagine you are trying to see if a student learned the material. But instead of testing them, you just look at how much they and their best friend talk to each other. If the two friends talk a lot (high correlation), you assume the student is smart, even if they learned nothing!
- The Problem: This ruler gets tricked when the hidden factors in the data are related to each other. If the "temperature" and "humidity" in your data are naturally linked, this ruler thinks the AI did a great job, even if the AI is completely confused. It gives a False Positive (saying "Good job!" when the work is bad).
The "Over-Critical" Ruler (DCI):
- The Analogy: Imagine a teacher who is so strict that if a student uses two pens to write one sentence, they get a zero. But the student actually wrote the sentence perfectly!
- The Problem: This ruler hates it when information is spread out. If an AI encodes a single concept using two different numbers (which is actually a smart way to do it), this ruler panics and says, "You failed!" It gives a False Negative (saying "You failed!" when the work is actually good).
The "Sample Size" Trap (All Rulers, but especially MCC):
- The Analogy: Imagine flipping a coin. If you flip it 5 times, you might get 5 heads by pure luck. If you flip it 1,000 times, you'll get close to 50/50.
- The Problem: In modern AI, we often have huge models (many "coins") but relatively few data points (few "flips"). The paper shows that if your model is too big compared to your data, these rulers will find "patterns" that don't exist. It's like the ruler is hallucinating a perfect score just because you didn't give it enough data to prove it wrong.

2. The Two Axes of Failure

The authors created a map to explain why these rulers fail. They say you have to look at two things:

The Data's Personality (DGP): Is the data messy and correlated? Are some factors just copies of others (redundant)?
The AI's Shape (Encoder): Did the AI output the right number of answers? Did it mix the answers together?

The Big Revelation:
Every ruler is only accurate in a tiny, specific corner of this map.

If your data is messy, use Ruler A.
If your AI is huge, use Ruler B.
If you use the wrong ruler for the situation, you get a lie.

3. The "Circuit" Example

To make this concrete, the authors used an example of an electrical circuit.

The Truth: Voltage, Current, and Resistance are linked by Ohm's Law ( $V = I \times R$ ). You can't change one without affecting the others.
The Trap: If an AI learns to predict Voltage, Current, and Resistance separately, it might look like it "disentangled" them. But because they are mathematically linked, the AI might just be memorizing the formula.
The Failure: Some rulers think the AI is brilliant because it predicted the numbers right. Others think it failed because it didn't treat them as separate, independent things. The paper argues that no current ruler can tell the difference between a smart AI and a lucky guess in these complex scenarios.

4. What Should You Do? (The Takeaway)

The paper doesn't just complain; it gives a "Practitioner's Checklist" for anyone building AI:

Don't trust a single number. Never look at just one score (like MCC) and say, "Great, my AI is identifiable!"
Check your ratios. If your model is huge but your data is small, your scores are likely fake. You need way more data than you think.
Run a "Null Test." Before you trust your results, run your ruler on a completely random, useless AI. If the ruler gives the random AI a high score, your ruler is broken and you can't trust it for your real AI.
Know your data. If your data has factors that are naturally linked (correlated), you must pick a ruler that handles that, or you will get fooled.

Summary

The paper is a wake-up call. It tells us that the tools we use to judge AI "understanding" are often flawed. They are like thermometers that give the wrong temperature if the room is windy or if you hold them too close to your hand.

The bottom line: We need to stop blindly trusting these scores. We need to understand the "structural conditions" (the shape of the data and the model) before we can trust that the AI has actually learned the truth, rather than just memorizing a trick.

1. Problem Statement

In representation learning, identifiability is the theoretical guarantee that a learned representation $\hat{z}$ recovers the ground-truth generative factors $z$ uniquely, up to a specific equivalence class (e.g., permutation, rescaling, or affine transformation). While theoretical guarantees for identifiability have advanced significantly (e.g., under auxiliary variables, temporal structure, or sparsity), empirical validation relies heavily on standard metrics like MCC (Mean Correlation Coefficient), $R^2$ , and DCI (Disentanglement, Completeness, Informativeness).

The core problem identified by the authors is that these metrics are often treated as "ground truth" evaluators without verifying if the evaluation setting matches the metric's implicit structural assumptions. The paper asks: "Who guards the guardians?" It argues that metrics frequently produce systematic false positives (high scores despite poor identifiability) and false negatives (low scores despite good identifiability) due to structural misspecification between the metric's design, the Data Generating Process (DGP), and the encoder's geometry.

2. Methodology

The authors employ a rigorous, controlled experimental framework to isolate metric behavior from optimization artifacts (e.g., training dynamics).

Synthetic Encoders: Instead of training neural networks, they construct deterministic encoders ( $f$ ) that map ground-truth factors ( $z$ ) to representations ( $\hat{z}$ ) according to specific geometric transformations. This ensures that any metric failure is due to the metric's definition, not the learning algorithm.
Taxonomy of Settings: They define a two-axis taxonomy to categorize evaluation settings:
1. Latent Factor Structure (DGP):
  - $D_\perp$ : Independent factors.
  - $D_\rho$ : Correlated (statistically dependent) factors.
  - $D_f$ : Single-factor functional constraints (redundancy, e.g., $z_2 = z_1^3$ ).
  - $D_F$ : Multi-factor functional constraints (synergistic redundancy, e.g., $z_3 = z_1 \cdot z_2$ ).
2. Encoder Geometry:
  - Equivalence Class: Permutation/Rescaling ( $G_{perm}$ ), Affine ( $G_{aff}$ ), or Nonlinear ( $G_{nl}$ ).
  - Dimensionality Mismatch: Matched ( $m=d$ ), Undercomplete ( $m<d$ ), and Overcomplete ( $m>d$ ).
Metrics Evaluated: The study focuses on MCC (Pearson and Spearman), $R^2$ , and DCI-D (Disentanglement component), while also analyzing MI-based metrics (MIG, InfoMEC) and conditional independence tests (T-MEX).
Theoretical Derivations: The authors derive closed-form expressions for metric expectations under specific conditions (e.g., null encoders, correlated factors) to explain observed behaviors mathematically.

3. Key Contributions

A. A Taxonomy of Metric Misspecification

The paper introduces a formal framework separating assumptions about latent factor structure from encoder properties. It defines four "Desiderata" (Properties) that a robust metric should satisfy:

Invariance to Latent Correlation: Metrics should not conflate correlation between factors with identifiability.
Faithfulness to Effective Dimensionality: Metrics should distinguish between dropping informative factors (lossy) and dropping redundant factors (lossless).
Invariance to Overcompleteness: Metrics should not penalize or reward representations simply because $m > d$ , provided the information is preserved.
Insensitivity to Uninformative Encoders: Metrics should return $\approx 0$ for random encoders, regardless of dimensionality ratios.

B. Systematic Failure Analysis

The authors demonstrate that no existing metric satisfies all four properties across all settings:

MCC (Mean Correlation Coefficient):
- False Positives: Under correlated factors ( $D_\rho$ ) and linearly entangled encoders ( $E_3$ ), MCC scores approach 1.0 even when the encoder is highly entangled. It conflates high correlation with identifiability.
- Finite-Sample Inflation: In overcomplete settings where the representation-to-sample ratio ( $m/n$ ) is high (common in mechanistic interpretability), MCC scores inflate significantly for random noise. Theoretical analysis shows the expected score scales as $\sqrt{2 \log(m/n)}$ .
- Failure on Distributed Codes: MCC fails for encoders where factors are encoded across multiple coordinates (e.g., $z_1$ encoded by $\sin(z_1)$ and $\cos(z_1)$ ), as it enforces a one-to-one matching.
DCI-D (Disentanglement):
- False Negatives: It is overly sensitive to entanglement. Even moderate linear mixing causes DCI-D scores to collapse near zero, making it impossible to distinguish between "slightly entangled" and "completely unidentifiable."
- Redundancy Blindness: It cannot distinguish between lossless compression (dropping a redundant factor) and lossy omission, often penalizing valid lossless compression.
- Overcompleteness Sensitivity: In overcomplete settings, DCI-D can be artificially inflated for linearly entangled encoders as $m/d$ increases.
$R^2$ :
- Robustness: $R^2$ is the most robust metric, satisfying Property 1 (correlation invariance) and Property 4 (null encoder insensitivity) in most regimes.
- Limitations: It struggles with multi-factor redundancy ( $D_F$ ) and requires larger sample sizes ( $n \gtrsim 500$ ) to converge in nonlinear settings.

C. Theoretical Insights

Correlation vs. Identifiability: The paper proves that MCC depends explicitly on the correlation matrix $\Sigma$ of the ground-truth factors. As correlation $\rho \to \pm 1$ , MCC $\to 1$ even for entangled encoders.
Null Encoder Floor: The paper derives that for random encoders, the expected MCC score is governed by the ratio $m/n$ , not $m/d$ . This explains why metrics fail in modern settings (e.g., Large Language Models) where $m$ (feature dimension) often exceeds $n$ (samples).

4. Key Results

No "Silver Bullet": There is no single metric that is trustworthy across all DGP types and encoder geometries.
Context Matters: A high score on one metric (e.g., MCC) in a correlated setting does not imply identifiability; it may simply reflect the correlation structure of the data.
Sample Size Criticality: In mechanistic interpretability (where $m \gg n$ ), standard metrics like MCC are fundamentally unreliable due to the "false positive floor."
Redundancy Detection: Current metrics generally fail to detect multi-factor functional constraints ( $D_F$ ), treating lossless compression of redundant factors as a failure.

5. Significance and Implications

Re-evaluation of Literature: The findings suggest that many claims of "disentanglement" or "identifiability" in recent literature may be artifacts of metric misspecification rather than true recovery of factors.
Practitioner Guidelines: The authors provide a checklist for researchers:
- Always report a null-encoder baseline.
- Check the $m/n$ ratio; if $>0.1$ , MCC is unreliable.
- Understand the DGP assumptions (independence vs. correlation) before choosing a metric.
- Use multiple metrics and compare against matched-dimension controls.
Future Directions: The paper calls for new metrics that are invariant to correlation, capable of detecting multi-factor redundancy, and robust to high-dimensional, low-sample regimes.

In summary, this paper serves as a critical "stress test" for the field of causal representation learning, demonstrating that the tools used to validate theoretical guarantees are themselves structurally flawed in many practical scenarios. It shifts the burden of proof from "showing a high metric score" to "demonstrating that the evaluation setting is valid for the chosen metric."

Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

1. The Broken Rulers (The Metrics)

2. The Two Axes of Failure

3. The "Circuit" Example

4. What Should You Do? (The Takeaway)

Summary

1. Problem Statement

2. Methodology

3. Key Contributions

A. A Taxonomy of Metric Misspecification

B. Systematic Failure Analysis

C. Theoretical Insights

4. Key Results

5. Significance and Implications

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank