Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations

This paper demonstrates that standard metrics for evaluating representation identifiability are often misspecified due to hidden assumptions about data-generating processes and encoder geometry, leading to systematic errors, and proposes a taxonomy and evaluation suite to address these limitations.

Shruti Joshi, Théo Saulus, Wieland Brendel, Philippe Brouillard, Dhanya Sridhar, Patrik Reizinger

Published 2026-03-02
📖 6 min read🧠 Deep dive

Imagine you are a detective trying to solve a mystery. You have a set of clues (the data) and you've built a theory about what caused them (the "learned representation"). To prove your theory is good, you need a way to measure how well your theory matches the real truth.

In the world of AI, this is called identifiability. Researchers build AI models to find the hidden "factors" behind data (like identifying that a picture of a cat is made of "fur," "ears," and "tail" rather than just a blob of pixels).

For years, scientists have used specific rulers (metrics) to measure how well these AI models are doing. The paper you shared asks a very important question: "Who guards the guardians?" In other words: Are these rulers actually measuring what they claim to measure, or are they broken?

The authors, Shruti Joshi and her team, discovered that most of these rulers are broken. They don't just measure the quality of the AI; they accidentally measure other things too, like the shape of the data or how many samples you have.

Here is the breakdown using simple analogies:

1. The Broken Rulers (The Metrics)

The paper tests three popular "rulers" used by AI researchers: MCC, DCI, and R2R^2. They found that each one fails in specific, predictable ways.

  • The "Confused Correlation" Ruler (MCC):

    • The Analogy: Imagine you are trying to see if a student learned the material. But instead of testing them, you just look at how much they and their best friend talk to each other. If the two friends talk a lot (high correlation), you assume the student is smart, even if they learned nothing!
    • The Problem: This ruler gets tricked when the hidden factors in the data are related to each other. If the "temperature" and "humidity" in your data are naturally linked, this ruler thinks the AI did a great job, even if the AI is completely confused. It gives a False Positive (saying "Good job!" when the work is bad).
  • The "Over-Critical" Ruler (DCI):

    • The Analogy: Imagine a teacher who is so strict that if a student uses two pens to write one sentence, they get a zero. But the student actually wrote the sentence perfectly!
    • The Problem: This ruler hates it when information is spread out. If an AI encodes a single concept using two different numbers (which is actually a smart way to do it), this ruler panics and says, "You failed!" It gives a False Negative (saying "You failed!" when the work is actually good).
  • The "Sample Size" Trap (All Rulers, but especially MCC):

    • The Analogy: Imagine flipping a coin. If you flip it 5 times, you might get 5 heads by pure luck. If you flip it 1,000 times, you'll get close to 50/50.
    • The Problem: In modern AI, we often have huge models (many "coins") but relatively few data points (few "flips"). The paper shows that if your model is too big compared to your data, these rulers will find "patterns" that don't exist. It's like the ruler is hallucinating a perfect score just because you didn't give it enough data to prove it wrong.

2. The Two Axes of Failure

The authors created a map to explain why these rulers fail. They say you have to look at two things:

  1. The Data's Personality (DGP): Is the data messy and correlated? Are some factors just copies of others (redundant)?
  2. The AI's Shape (Encoder): Did the AI output the right number of answers? Did it mix the answers together?

The Big Revelation:
Every ruler is only accurate in a tiny, specific corner of this map.

  • If your data is messy, use Ruler A.
  • If your AI is huge, use Ruler B.
  • If you use the wrong ruler for the situation, you get a lie.

3. The "Circuit" Example

To make this concrete, the authors used an example of an electrical circuit.

  • The Truth: Voltage, Current, and Resistance are linked by Ohm's Law (V=I×RV = I \times R). You can't change one without affecting the others.
  • The Trap: If an AI learns to predict Voltage, Current, and Resistance separately, it might look like it "disentangled" them. But because they are mathematically linked, the AI might just be memorizing the formula.
  • The Failure: Some rulers think the AI is brilliant because it predicted the numbers right. Others think it failed because it didn't treat them as separate, independent things. The paper argues that no current ruler can tell the difference between a smart AI and a lucky guess in these complex scenarios.

4. What Should You Do? (The Takeaway)

The paper doesn't just complain; it gives a "Practitioner's Checklist" for anyone building AI:

  1. Don't trust a single number. Never look at just one score (like MCC) and say, "Great, my AI is identifiable!"
  2. Check your ratios. If your model is huge but your data is small, your scores are likely fake. You need way more data than you think.
  3. Run a "Null Test." Before you trust your results, run your ruler on a completely random, useless AI. If the ruler gives the random AI a high score, your ruler is broken and you can't trust it for your real AI.
  4. Know your data. If your data has factors that are naturally linked (correlated), you must pick a ruler that handles that, or you will get fooled.

Summary

The paper is a wake-up call. It tells us that the tools we use to judge AI "understanding" are often flawed. They are like thermometers that give the wrong temperature if the room is windy or if you hold them too close to your hand.

The bottom line: We need to stop blindly trusting these scores. We need to understand the "structural conditions" (the shape of the data and the model) before we can trust that the AI has actually learned the truth, rather than just memorizing a trick.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →