When Visual Evidence is Ambiguous: Pareidolia as a Diagnostic Probe for Vision Models

Imagine you are walking down a street and you see a strange shape on a wall. To you, it looks exactly like a grumpy face with furrowed brows. To your friend, it's just a weirdly shaped electrical outlet.

This phenomenon is called pareidolia—the human brain's tendency to see faces in random patterns (like clouds, toast, or car grilles).

This paper asks a simple but profound question: If we show these "fake faces" to AI, what will they see? Will they act like your friend (ignoring it), or will they act like you (seeing a face)? And more importantly, how do they decide?

The researchers didn't just ask "Did the AI see a face?" They built a diagnostic tool to understand the personality and decision-making style of different AI models when they are confused.

Here is the breakdown of their findings using simple analogies:

1. The Three Types of AI "Personalities"

The researchers tested six different AI models. They found that the models fall into three distinct groups based on how they handle ambiguity:

Group A: The Over-Enthusiastic Storyteller (Vision-Language Models)

The Models: CLIP and LLaVA.
The Analogy: Imagine a detective who is so obsessed with solving "The Case of the Missing Face" that they start seeing suspects everywhere. Even if the evidence is weak (like a cloud that looks sort of like a nose), this detective says, "I'm 99% sure that's a person!"
The Finding: These models are confident but wrong. They have a strong bias toward seeing "Humans" in everything. The more advanced the model (LLaVA), the more confident it is in its hallucinations. They don't hesitate; they just over-interpret.

Group B: The Cautious Skeptic (Pure Vision Classifiers)

The Model: ViT.
The Analogy: Imagine a scientist looking at the same cloud. They squint, tilt their head, and say, "Hmm, it could be a face, or it could be a rock, or a bird. I'm not sure."
The Finding: This model is uncertain but fair. Instead of forcing a guess, it spreads its confidence out. It says, "I don't know," which actually prevents it from making a big mistake. It avoids bias by admitting it's confused.

Group C: The Strict Security Guard (Object Detectors)

The Models: YOLOv8 and RetinaFace.
The Analogy: Imagine a bouncer at a club with a very strict list. If the person at the door doesn't look exactly like a real human face, the bouncer doesn't even let them in the building to check. They just say, "Nope, not a face," and move on.
The Finding: These models are conservative. They are trained to find real faces, so they ignore "fake" faces entirely. They rarely make mistakes because they refuse to guess.

2. The Big Surprise: Confidence $\neq$ Safety

The most important discovery in this paper is that being confident does not mean you are right.

The Trap: We often think that if an AI says, "I am 99% sure this is a face," it must be safe to trust.
The Reality: The "Over-Enthusiastic Storyteller" (LLaVA) was the most confident model, yet it was the most likely to see a face where there wasn't one.
The Lesson: Low uncertainty (confidence) can mean two opposite things:
1. Safe: The "Strict Security Guard" is confident because it knows the rules and is suppressing false alarms.
2. Dangerous: The "Storyteller" is confident because it is hallucinating a pattern that isn't there.

3. The "Emotion" Twist

The researchers also tested how the AI reacted to the mood of the fake faces.

The Finding: When the fake faces looked "scared" or "angry," the "Storyteller" models (VLMs) were even more likely to say, "That's a human!"
Why? It seems these models treat emotional cues as extra proof that a face is real. If a cloud looks like a sad face, the AI thinks, "Aha! It's definitely a human!" This is a dangerous flaw for safety systems (like medical imaging or surveillance), where you don't want the AI to get excited by a scary-looking cloud.

4. Why Does This Matter?

We are building AI to do critical jobs: spotting tumors in X-rays, monitoring security cameras, or filtering bad content.

If we only test AI on clear, obvious pictures, we miss these hidden flaws.
Pareidolia is a stress test. It's like shaking a building to see if the foundation holds.
The paper shows that we can't just "tweak the settings" to fix these errors. The problem is built into how the AI is designed (its "brain architecture").
- If you want an AI that doesn't hallucinate faces, you can't just tell it to be "less confident." You have to change how it connects language and images.

Summary

This paper teaches us that when AI gets confused, it doesn't just "fail." It fails in specific, predictable ways based on its personality:

Some are too eager (seeing faces everywhere).
Some are too unsure (admitting they don't know).
Some are too strict (ignoring everything).

The key takeaway is: Don't trust an AI just because it sounds confident. In the world of ambiguous images, confidence can be a mask for a very strong bias.

1. Problem Statement

The paper addresses a critical gap in evaluating computer vision models: how they behave when visual evidence is ambiguous. Standard benchmarks often rely on clear, high-contrast signals, masking how models resolve uncertainty and apply prior knowledge.

The Phenomenon: The authors focus on Face Pareidolia—the psychological phenomenon where humans perceive faces in non-face objects (e.g., a wall socket looking like a face).
The Challenge: When visual evidence is weak, models must decide whether to interpret ambiguous patterns as meaningful (faces) or ignore them. Current evaluations often fail to distinguish between:
- Localization failures: Missing the region entirely.
- Semantic bias: Correctly localizing a region but misinterpreting its semantic category (e.g., classifying a non-human object as "Human").
- Uncertainty: How confident the model is in its decision.
The Gap: Previous work (e.g., Hamilton et al.) focused primarily on face detectors. This paper argues for a broader analysis across diverse representational regimes (Vision-Language Models, pure vision classifiers, and object detectors) to understand how different architectures handle ambiguity.

2. Methodology

The authors introduce a unified diagnostic framework using the FacesInThings dataset, the only large-scale public dataset of human-annotated face pareidolia.

A. Dataset and Task

Dataset: FacesInThings (~5,000 images) containing annotated face-like regions in inanimate objects.
Annotations: Each region is labeled with a primary concept (Human, Animal, Cartoon, Alien, Other), difficulty level (Easy/Medium/Hard), and emotion (e.g., happy, angry, scared).
Protocol: The authors map all model outputs to a common 5-class space to enable direct comparison.

B. Models Evaluated

Six models spanning four representational regimes were evaluated without fine-tuning:

Vision-Language Models (VLMs):
- Contrastive: CLIP ViT-B/32, CLIP ViT-L/14.
- Generative: LLaVA-1.5-7B.
Pure Vision Classification: ViT-B/16 (ImageNet pre-trained).
General Object Detection: YOLOv8.
Specialized Face Detection: RetinaFace.

C. Diagnostic Metrics

The framework separates detection, localization, uncertainty, and bias:

Detection & Localization:
- Detection Rate: Does the model respond to the region at all?
- Primary Pareidolia Detection Rate (PPDR): Does the model correctly localize the region (IoU $\ge$ 0.2 or center inclusion)?
Uncertainty Quantification:
- Representation Ambiguity Index (RAI): Shannon entropy of the model's 5-class probability distribution. High RAI = diffuse uncertainty; Low RAI = confident prediction.
Bias Measurement:
- False Bias Score (FBS): Probability of predicting "Human" given a non-human region.
- Non-Human $\to$ Human Rate: Systematic over-calling of non-human objects as faces.
GT-Box Controlled Evaluation: To isolate semantic bias from localization errors, detectors are run on pre-cropped ground-truth boxes. This tests if the model would classify the region as a face if the location were known.

3. Key Contributions

Unified Pareidolia Diagnostic: A compact evaluation suite that measures detection, localization, uncertainty, and bias across class, difficulty, and emotion, shifting pareidolia from a simple benchmark to a representation-level probe.
Cross-Regime Comparison: The first direct comparison of VLMs, pure vision classifiers, and detectors under a single protocol, revealing distinct behavioral patterns.
Decoupling of Uncertainty and Bias: The paper establishes that predictive uncertainty is not a reliable proxy for semantic safety. Low uncertainty can indicate either safe suppression (detectors) or extreme over-interpretation (VLMs).
Affective and Structural Modulation: Demonstrating that negative emotions amplify semantic bias in VLMs, while strong architectural priors in detectors suppress pareidolia regardless of emotional cues.

4. Key Results

The analysis reveals three distinct mechanisms of interpretation under ambiguity:

A. Vision-Language Models (VLMs): Semantic Overactivation

Behavior: VLMs exhibit strong directional bias, systematically pulling ambiguous non-human regions toward the "Human" concept.
Confidence vs. Bias:
- CLIP: Shows moderate uncertainty but strong bias, particularly for negative emotions (scared, angry), suggesting affective cues act as semantic evidence for "Human."
- LLaVA-1.5-7B: Exhibits the strongest bias and lowest uncertainty (near-deterministic predictions). It confidently misclassifies non-human patterns as faces, proving that high confidence does not guarantee safety.
Scaling: Increasing model size (CLIP-B to CLIP-L) improves calibration and shifts some probability to other classes but does not eliminate the "Human" over-call.

B. Pure Vision Classification (ViT): Uncertainty-as-Abstention

Behavior: ViT remains largely unbiased.
Mechanism: When faced with ambiguity, ViT spreads probability mass across multiple classes (High RAI) rather than committing to "Human." It avoids systematic misclassification by remaining diffuse.

C. Detection-Based Models (YOLOv8, RetinaFace): Conservative Priors

Behavior: These models achieve low bias through suppression.
Mechanism: They rely on strong priors (e.g., "this is a face detector trained on real faces") to gate responses.
- RetinaFace: Extremely conservative; rarely fires on pareidolic regions (near-zero response rate).
- YOLOv8: Moderate suppression; detects ~40% of regions but maintains low bias.
GT-Box Control: Even when localization is forced (using ground-truth boxes), detectors remain conservative. This confirms their low bias stems from semantic gating, not just localization failure.

D. The Role of Emotion

VLMs: Negative emotions significantly increase the rate of "Human" over-calls in CLIP and LLaVA.
Detectors/ViT: Show minimal modulation by emotion, maintaining low bias regardless of the affective cue.

5. Significance and Implications

Redefining Safety: The paper challenges the assumption that high confidence (low uncertainty) implies safety. In VLMs, high confidence can signal extreme over-interpretation, whereas in detectors, it signals safe suppression.
Architecture-Specific Bias: Bias is not a universal property but a result of specific representational choices (e.g., contrastive alignment vs. generative training vs. detection priors).
Limitations of Threshold Tuning: Simply adjusting confidence thresholds cannot fix the bias in VLMs because the issue is semantic directionality (the model is structurally biased toward "Human"), not just calibration.
Future Directions: The authors propose using pareidolia as a standard tool for ambiguity-aware training. By treating pareidolic inputs as "hard negatives," researchers can sharpen representational boundaries and mitigate semantic overactivation in safety-critical systems (e.g., medical imaging, surveillance).

In summary, the paper demonstrates that how a model resolves ambiguity is governed more by its representational regime and priors than by its score thresholds, and that pareidolia serves as a powerful, compact diagnostic for uncovering these hidden structural biases.