Certainty-Validity: A Diagnostic Framework for Discrete Commitment Systems

Imagine you are hiring a new employee to sort mail. You have two candidates: Candidate A and Candidate B.

Candidate A is a speed demon. They sort 100 letters in a minute. They get 83 of them right. But for the 17 letters that are confusing or torn, they guess wildly, shout "I'm sure!" and put them in the wrong bin.
Candidate B is a bit slower. They also sort 100 letters. They get 83 right. But for the 17 confusing letters, they pause, look at them, and say, "I'm not sure what this is," and set them aside in a "Needs Review" pile.

In the old world of machine learning, both candidates are rated exactly the same. The standard score (Accuracy) only counts how many you got right. It doesn't care how you got them right. It treats a confident mistake the same as a humble mistake.

This paper argues that for "discrete commitment systems" (AI models that make firm decisions like "Yes," "No," or "I don't know"), this old way of scoring is broken. It introduces a new way to judge them called Certainty-Validity (CVS).

Here is the breakdown of the paper's big ideas, using simple analogies:

1. The "83% Ceiling" Mystery

The researchers noticed something weird. No matter how much they trained their AI on standard tests (like recognizing clothes, handwriting, or movie reviews), the AI would hit a wall at 83% accuracy. It couldn't go higher.

The Old Theory: "The AI is too dumb. It needs more brainpower or better math to get past 83%."
The New Theory (The Paper's Discovery): "The AI isn't dumb; the test is messy."

Think of the test like a bag of fruit. 83% of the fruit are clearly apples, oranges, and bananas. The other 17% are weird hybrids (like a tomato that looks like a strawberry).

When the AI sees a clear apple, it says "Apple!" and gets it right.
When it sees the weird hybrid, it should say, "I don't know."
But standard training forces the AI to guess anyway. If it guesses, it's wrong. If it says "I don't know," it's also "wrong" according to the test because the test demands a specific answer.

The paper proves that if you remove the "weird hybrids" (the ambiguous data) from the test, the AI suddenly jumps to 97% or 99% accuracy. The "ceiling" wasn't the AI's limit; it was the limit of the messy data.

2. The Four Quadrants: The "Confidence vs. Correctness" Grid

The authors created a new scoreboard with four boxes to see what the AI is actually doing:

	Correct	Incorrect
High Confidence ("I'm sure!")	🌟 Confident-Correct (The Gold Standard: "I know this is an apple.")	☠️ Confident-Incorrect (The Danger Zone: "I'm 100% sure this is an apple," but it's a tomato.)
Low Confidence ("I'm not sure.")	🤔 Uncertain-Correct (Lucky guess or honest hesitation: "I think it's an apple, but I'm not sure.")	🛡️ Uncertain-Incorrect (The Healthy State: "I don't know what this is, so I won't guess.")

The Big Insight:

Confident-Incorrect (CI) is the real failure. This is "hallucination." The AI is lying to you with confidence.
Uncertain-Incorrect (UI) is actually a feature, not a bug. It means the AI knows when it's out of its depth. It's like a doctor saying, "I don't know what this is, let's get a specialist," rather than prescribing the wrong medicine.

3. The "Benign Overfitting" Trap

Here is the scary part the paper uncovers. As you train an AI longer and longer to get that perfect score:

It starts out honest. It sees the weird data and says, "I don't know." (High Uncertainty).
As you force it to keep training, it stops saying "I don't know."
Instead, it starts guessing confidently on the weird data to please the teacher.

The paper calls this Benign Overfitting.

Old View: "Look! The accuracy went up slightly! The model is getting better!"
New View: "Look! The model stopped admitting it was confused and started confidently guessing wrong. It traded honesty for a slightly higher score."

It's like a student who stops studying the hard concepts and just memorizes the answer key. They get the right score, but they don't actually understand the material. If you ask them a slightly different question, they will confidently give the wrong answer.

4. The "Platonic Spike"

At the very beginning of training (Epoch 1), the AI often shows a "Platonic Spike."

What it is: The AI gets more right on the test than on the training data.
Why it happens: It has discovered the "soul" or "structure" of the problem (e.g., "Shoes have laces, pants have legs") before it starts memorizing the specific details.
The Lesson: The best AI isn't the one trained the longest. It's often the one at the "Platonic Spike" moment, where it understands the rules but hasn't yet started memorizing the noise.

5. Real-World Analogy: Video Game Reviews

The authors apply this to video games. Imagine a game developer trying to understand player feedback:

Confident-Correct: Fans who loved the game and got exactly what they expected. (Great!)
Confident-Incorrect: Fans who bought the game expecting a horror game, but got a farming sim, and left a 1-star review. This is the disaster. The marketing lied, or the onboarding failed.
Uncertain-Incorrect: People who tried a weird new genre, didn't like it, but knew they were taking a risk. This is fine. They knew the risk.

The goal isn't to get 100% of people to love the game. The goal is to ensure that the people who don't love it knew they might not like it before they bought it.

Summary: What Should We Do?

The paper concludes that we need to stop judging AI (and humans) solely on "How many did you get right?"

Instead, we should ask: "Did you know when you were wrong?"

Good Training: Maximizes the Certainty-Validity Score. This means the AI is confident when it's right, and admits uncertainty when it's wrong.
Bad Training: Forces the AI to guess on everything, turning honest uncertainty into dangerous hallucinations.

The Takeaway: A model that says "I don't know" is often smarter and safer than a model that confidently guesses "I know" when it doesn't. The 83% limit isn't a failure; it's the AI politely telling us, "The rest of this data is too messy for me to make a promise about."

1. Problem Statement

Standard machine learning evaluation metrics (accuracy, precision, recall, AUROC) operate on the assumption that all errors are equivalent. They penalize a Confident-Incorrect (CI) prediction (a hallucination) identically to an Uncertain-Incorrect (UI) prediction (appropriate doubt).

This assumption is epistemologically flawed for Discrete Commitment Systems—architectures designed to select ternary states $\{-W, 0, +W\}$ to represent logical commitments (positive, negative, or neutral/withheld).

The Core Issue: In these systems, outputting $0 $(uncertainty) on ambiguous data is a correct epistemic behavior, whereas outputting a strong signal ($ \pm W$) on the same data is a hallucination.
The Phenomenon: Discrete models consistently plateau at an 83% accuracy ceiling on standard benchmarks (Fashion-MNIST, EMNIST, IMDB). Standard metrics cannot distinguish whether this plateau is caused by an architectural capacity limit (Hypothesis 1) or intrinsic dataset ambiguity (Hypothesis 2).
The Risk: Continued training often leads to "Benign Overfitting," where the model migrates from Uncertain-Incorrect (knowing it doesn't know) to Confident-Incorrect (hallucinating structure), degrading safety and reliability while maintaining or slightly improving raw accuracy.

2. Methodology

The paper introduces the Certainty-Validity (CVS) Framework and applies it through ablation studies and trajectory analysis.

A. The Certainty-Validity (CVS) Matrix

The framework decomposes model performance into a $2 \times 2$ matrix based on Certainty (High/Low) and Validity (Correct/Incorrect):

Confident-Correct (CC): High certainty, correct prediction.
Confident-Incorrect (CI): High certainty, incorrect prediction (Hallucination).
Uncertain-Correct (UC): Low certainty, correct prediction.
Uncertain-Incorrect (UI): Low certainty, incorrect prediction (Appropriate abstention).

Derived Metrics:

CommitAcc: Accuracy only on samples where the model commits (High Certainty).
AppropUncert: The proportion of errors that were correctly flagged as uncertain (UI / Total Errors).
Coverage: The proportion of samples where the model makes a confident commitment.
CVS Score: A composite metric maximizing reliable commitment and appropriate uncertainty.

B. Experimental Design

The authors conducted ablation experiments on three benchmarks to isolate structural ambiguity:

Fashion-MNIST: Removed three topologically ambiguous classes (Shirt, Pullover, Coat) which share identical shapes but differ in texture.
EMNIST: Used the "digits-only" split to avoid letter/number confusion (e.g., O/0).
IMDB: Filtered for "strong sentiment" (ratings $\ge 8$ or $\le 3$ ) to remove mixed/sarcastic reviews, leaving only structurally clear positive/negative signals.

Architecture:

ProbableCollapseLayer: Implements the discrete ternary selection mechanism.
FractalOptimizer: Uses multi-scale learning rates (coarse, triadic, fine) to facilitate structural discovery.
Gumbel-Softmax: Temperature ( $\tau$ ) is kept moderate (0.7–0.9) to balance exploration and commitment, rather than annealing to near-zero.

3. Key Results

A. The "83% Ambiguity Ceiling" is Dataset-Dependent

When structurally ambiguous samples were removed, the accuracy ceiling vanished:

Fashion-MNIST: Accuracy rose from 83% (10 classes) to 97% (7 topologically distinct classes).
EMNIST: Digits-only accuracy reached 99.59%.
IMDB: Strong-sentiment filtering raised peak accuracy to 87% (vs. 83% on full dataset).
Conclusion: The 83% ceiling represents the limit of learnable structure in the dataset. The remaining ~17% of data requires texture or context the discrete architecture intentionally refuses to hallucinate.

B. The "Platonic Spike"

On clean, structurally distinct data, the model exhibits a Platonic Spike at Epoch 1: Test accuracy significantly exceeds Training accuracy (e.g., +14.69% on Fashion-MNIST). This indicates the model discovers underlying topological structure immediately, rather than memorizing noise. On ambiguous data, this spike disappears.

C. Mechanism of Benign Overfitting

Tracking CVS across training epochs reveals the true nature of overfitting:

Migration: As training progresses, errors migrate from UI (Uncertain-Incorrect) to CI (Confident-Incorrect).
The Illusion: Raw accuracy may remain stable or slightly increase, but AppropUncert drops drastically, and CVS collapses.
Example (IMDB):
- Epoch 1: 82.11% accuracy, CVS = 0.52 (High self-awareness).
- Epoch 9: 86.30% accuracy, CVS = 0.15 (Low self-awareness; model is confidently wrong).
Implication: The model does not learn the ambiguous data; it learns to be overconfident about its guesses.

D. Excitability Phase Diagram

Using MNIST (a clean dataset), the authors mapped the training trajectory against Train-Test Divergence and CVS.

Phase 1 (Discovery): High negative divergence (Test > Train), moderate CVS.
Phase 2 (Optimal): Near-zero divergence, peak CVS.
Phase 3 (Benign Overfitting): Divergence remains near zero, but CVS cascades downward as the model loses the ability to distinguish knowledge from guessing.

4. Key Contributions

The CVS Framework: A diagnostic tool that separates Reliability (CommitAcc) from Self-Awareness (AppropUncert), revealing failure modes hidden by standard accuracy.
Redefining Failure: Establishes that Uncertain-Incorrect (UI) is a valid, desirable epistemic state for ambiguous data, while Confident-Incorrect (CI) is the true failure mode.
Diagnosis of the 83% Ceiling: Proves the ceiling is a property of dataset ambiguity, not architectural limitation. Discrete models correctly withhold commitment on ambiguous samples.
Mechanism of Overfitting: Quantitatively demonstrates that overfitting in discrete systems is the loss of appropriate uncertainty, not just the loss of generalization accuracy.
Training Guidelines:
- Early Stopping: Should be triggered by the decline of AppropUncert or CVS, not just accuracy peaks.
- Gumbel-Softmax Temperature: Recommends keeping $\tau$ in the 0.7–0.9 range to preserve uncertainty expression, contrary to standard annealing practices.

5. Significance and Impact

Safety-Critical AI: For applications where confident errors are catastrophic (e.g., medical diagnosis, autonomous driving), a model with lower accuracy but high CVS (knowing what it doesn't know) is superior to a high-accuracy model that hallucinates.
Benchmark Reform: Suggests that standard benchmarks should report performance on "structurally unambiguous" subsets and track the Platonic Spike to distinguish structural learning from texture memorization.
Generalizability: The framework extends beyond ML to any domain involving commitment and uncertainty, such as Game Design (analyzing player expectation mismatches) and Human-Computer Interaction.
Epistemic Integrity: The paper argues that "good training" for reasoning systems is defined by maximizing the Certainty-Validity Score, ensuring the model knows where to stop, rather than maximizing raw accuracy at the cost of epistemic humility.

In summary, the paper posits that the 83% ceiling is not a failure of the model, but a feature of its correct operation in the face of ambiguity. The proposed CVS framework provides the necessary tools to detect when a model stops being a reasoner and starts being a confident guesser.