Do Machines Fail Like Humans? A Human-Centred Out-of-Distribution Spectrum for Mapping Error Alignment

The Big Question: Do AI and Humans Make Mistakes the Same Way?

Imagine you are taking a difficult driving test. You and a self-driving car both get stuck in a heavy fog.

The Human might slow down, squint, and guess the road is to the left because they remember the smell of the grass there.
The Car might stop completely because its sensors can't see the white lines.

Both of you "failed" to drive through the fog, but you failed for different reasons.

For a long time, scientists have only asked: "Who got the right answer?" If both you and the car got 90% of the questions right on a clear day, we assumed they were the same. But this paper argues that getting the right answer isn't enough. To truly trust an AI, we need to know: When it gets things wrong, does it get them wrong the same way a human does?

The Problem: The "Volume Knob" Confusion

The researchers tried to test this by showing humans and AI pictures that were "broken" or distorted (like blurry, noisy, or upside-down images). They wanted to see how the errors changed as the pictures got worse.

But they hit a snag. It was like trying to compare two different volume knobs:

Knob A (Low-pass filter) turns the music into a muffled hum.
Knob B (High-pass filter) turns the music into a harsh screech.

If you turn Knob A to "5" and Knob B to "5," are they equally loud? No. One might be barely audible, while the other is deafening.

Previous studies just turned the knobs to the same number and compared the results. This is unfair because the "difficulty" for a human brain wasn't actually the same. One might be easy, and the other impossible.

The Solution: The "Human Difficulty Scale"

To fix this, the authors created a Human-Centred OOD Spectrum.

Think of it like a thermometer for confusion. Instead of measuring how much they twisted the image knobs, they measured how confused the humans got.

Reference: A clear, sunny day (Easy).
Near-OOD: A light drizzle (Moderately hard).
Far-OOD: A heavy storm (Very hard).
Extreme-OOD: A blizzard where you can't see your hand (Impossible).

They mapped every distorted image onto this scale based on human accuracy. If humans got 50% right on a blurry image, that image belongs in the "Moderately Hard" zone, regardless of what kind of filter created it. This allowed them to compare apples to apples.

The Four "Regimes" of Failure

Once they sorted the images by how hard they were for humans, they found four distinct zones where AI behaves differently:

The Reference Zone (Easy): Everyone gets it right. Boring.
Near-OOD (The "Tricky" Zone): Things get a little messy.
Far-OOD (The "Stormy" Zone): Things get really broken.
Extreme-OOD (The "Black Hole" Zone): The image is so broken that even humans are just guessing. The researchers decided to ignore this zone because if humans are guessing, the AI's performance doesn't matter.

The Big Discoveries: Who is the Most "Human"?

The researchers tested three types of AI "architectures" (different brain structures):

CNNs: The old-school, reliable workers (like a classic car).
ViTs: The modern, high-tech processors (like a sports car).
VLMs: The "multitaskers" that can see and read (like a librarian who can also drive).

Here is what they found:

1. The "Near-OOD" Surprise

When the images were just a little bit distorted (Near-OOD):

CNNs acted very much like humans. They made similar mistakes.
ViTs (the modern ones) were actually less like humans. They got the right answers more often, but when they failed, they failed in weird, non-human ways.
VLMs were also very human-like here.

Analogy: In light rain, the classic car (CNN) drives exactly like a human driver. The sports car (ViT) drives faster but takes corners in a way a human wouldn't.

2. The "Far-OOD" Flip

When the images were heavily distorted (Far-OOD):

CNNs crashed. They stopped making sense and their errors became random.
ViTs suddenly became very human-like! They handled the chaos much better than the old-school models.
VLMs remained the most consistent. They stayed human-like in both light rain and heavy storms.

Analogy: In a blizzard, the classic car (CNN) spins out and stops. The sports car (ViT) suddenly finds a way to drive that feels surprisingly human. The librarian-driver (VLM) keeps driving steadily the whole time.

Why Does This Matter?

The paper concludes with a crucial insight: High accuracy is a trap.

If you only look at who got the most points, you might think the ViT is the best. But if you look at how they fail, you see that:

CNNs are fragile; they break when things get really weird.
ViTs are good at handling weirdness, but they have a different "personality" than humans when things are normal.
VLMs (Vision-Language Models) are the most "human" overall. Because they learned from text and images together, they seem to have a "common sense" that helps them fail gracefully, just like we do.

The Takeaway

We shouldn't just ask, "Is the AI smart?" We should ask, "Does the AI break like a human?"

If an AI makes mistakes that look like human mistakes, it is more predictable and trustworthy. If it makes weird, alien mistakes, it might be dangerous in the real world. This new "Human Difficulty Scale" is a tool to help us build AI that doesn't just get the right answers, but understands the world the way we do.

1. Problem Statement

Evaluating whether AI systems process information similarly to humans is critical for cognitive science and trustworthy AI. While modern models often match human accuracy on standard tasks, this parity does not guarantee error alignment (i.e., making the same mistakes). Existing methods for assessing model-human alignment under challenging conditions (Out-of-Distribution or OOD) suffer from four fundamental flaws:

Human-Centric Definition Gap: OOD is typically defined relative to a model's training distribution, which does not apply to humans who lack a finite, controlled training set.
Incommensurable Severity: Distortion severity is defined by arbitrary image-processing parameters (e.g., filter strength) rather than a universal measure of human perceptual difficulty. A "parameter 5" low-pass filter is not directly comparable to a "parameter 1" high-pass filter in terms of human struggle.
Meaningless Extremes: Some distortion levels render images unrecognizable to humans (chance-level performance), making them unsuitable for alignment studies, yet they are often included in analyses.
Lack of Baseline: Raw alignment scores are misleading without a baseline (e.g., human-human alignment). If humans cannot agree on a stimulus, high model-human alignment is impossible to achieve.

2. Methodology

The authors propose a Human-Centred Behavioral Deviation Framework to create a principled OOD spectrum.

A. Constructing the Human-Centred OOD Spectrum

Instead of using distortion parameters, the authors define OOD based on human perceptual difficulty.

Reference Distribution: Undistorted images serve as the baseline.
OOD Score Calculation: They quantify the deviation of human performance on distorted images from the baseline using Glass's $\Delta$ (effect size) applied to logit-transformed accuracy scores:
$\Delta = \frac{\bar{l}_d - \bar{l}_{ud}}{s_{ud}}$
Where $\bar{l}_d$ and $\bar{l}_{ud}$ are mean logit-accuracies for distorted and undistorted sets, and $s_{ud}$ is the standard deviation of the undistorted set. This standardizes deviations across different distortion types.
Regime Identification: A Gaussian Mixture Model (GMM) is fitted to the distribution of OOD scores, identifying four distinct regimes:
1. Reference: Natural variation/undistorted.
2. Near-OOD: Moderate accuracy reduction.
3. Far-OOD: Transitional zone with varying decline rates.
4. Extreme-OOD: Chance-level performance (excluded from final alignment analysis as it lacks meaningful information).

B. Error Alignment Metrics

To analyze how systems fail, the study employs three metrics:

Error Consistency (EC): Measures the overlap of misclassified samples between systems (do they fail on the same images?).
Misclassification Agreement (MA): Measures agreement on the specific incorrect label chosen when both systems fail.
Class-Level Error Divergence (CLED): Measures the similarity of error structures across different conditions using confusion matrices, allowing comparison across non-overlapping stimulus sets.

C. Experimental Setup

Dataset: modelvshuman benchmark (16 object categories, 14 distortion types, 31 models).
Models: 31 models across three families: CNNs (VGG, ResNet, etc.), Vision Transformers (ViTs) (ViT, Swin, MaxViT), and Vision-Language Models (VLMs) (CLIP, BLIP).
Analysis: Models are evaluated against human performance across the defined OOD regimes.

3. Key Results

A. Human Error Profiles are Structured by Difficulty, Not Distortion Type

Regime Dominance: Quantitative analysis (Cohen's D, Permutation Tests) reveals that OOD level (perceptual difficulty) has a significantly stronger effect on human error structure than the distortion type.
Behavioral Shifts:
- Near-OOD: Human errors are stimulus-driven. Participants consistently fail on the same "hard" images and often choose the same incorrect labels (high EC and MA).
- Far-OOD: Human errors become observer-dependent. EC and MA drop significantly, indicating that severe distortions disrupt recognition differently for different individuals, and "false leads" (shared wrong answers) disappear.
- Extreme-OOD: Metrics flatline; no meaningful class information remains.

B. Model-Human Alignment Rankings Shift by Regime

The alignment between models and humans is not static; it depends heavily on the difficulty regime:

Near-OOD:
- CNNs are more aligned with humans than ViTs.
- VLMs show the most consistent alignment.
- Insight: CNNs may align here because texture and shape cues are still partially correlated, or humans rely on texture cues under moderate degradation.
Far-OOD:
- ViTs surpass CNNs and become comparable to VLMs.
- CNNs collapse (EC and MA approach zero), indicating their representations fail catastrophically when fine details are removed.
- ViTs maintain alignment, likely due to reduced reliance on high-frequency texture features compared to CNNs.
VLMs: Demonstrate the most consistent alignment across both Near and Far-OOD regimes. Their semantic knowledge (from text training) appears to scaffold decision-making even when visual features degrade, mimicking human reliance on semantic context under uncertainty.

C. Architectural Fingerprints

Models within the same superfamily (e.g., all CNNs) exhibit similar error-alignment profiles ("fingerprints"), distinct from other families.
However, ViTs show less internal consistency in sub-families compared to CNNs, suggesting diverse error patterns within the Transformer architecture.

4. Key Contributions

Human-Centred OOD Spectrum: A novel framework that redefines OOD not by statistical deviation from training data, but by quantifiable human perceptual difficulty. This allows for fair, apples-to-apples comparisons across different distortion types.
Regime-Specific Analysis: Demonstrates that model-human alignment is regime-dependent. Aggregating results across all distortion levels masks critical differences in how architectures fail.
Architectural Insights:
- Reveals that high accuracy does not imply human-like error patterns (e.g., ViTs have high accuracy but poor Near-OOD alignment).
- Identifies VLMs as the most robustly human-aligned architecture across the spectrum, suggesting multimodal training provides a "semantic scaffold" for robustness.
Methodological Rigor: Establishes human-human alignment as a necessary baseline ceiling, proving that current models have not yet reached the consistency of human visual processing.

5. Significance

This work shifts the paradigm of AI evaluation from "accuracy on hard tasks" to "alignment of failure modes."

Trustworthiness: Models that fail like humans are more predictable and interpretable. If a model fails in a way humans do not (e.g., CNNs in Far-OOD), it indicates brittle, non-human biases that could lead to unsafe deployment.
Cognitive Science: The framework provides a tool to test hypotheses about human visual processing strategies by observing which models align with human errors under specific perceptual constraints.
Future Design: Highlights the need for architectures that can maintain robustness (graceful degradation) similar to humans, suggesting that integrating semantic knowledge (as in VLMs) is a viable path toward more trustworthy AI.