Shape vs. Context: Examining Human--AI Gaps in Ambiguous Japanese Character Recognition

The Big Idea: Are AI Brains Like Human Brains?

Imagine you have a super-smart robot that can read almost anything perfectly. It gets 99% of the answers right on a test. But does it "think" the way you do when it's confused?

This paper asks a simple question: When the robot sees something blurry or ambiguous, does it guess the same way a human would?

The author, Daichi Haraguchi, decided to test this using two very similar Japanese characters that look almost identical:

ソ (So): Looks like a little "S" or a checkmark.
ン (n): Looks like a "n" or a tiny hook.

The difference between them is tiny—just the angle of one little line. To humans, this is a classic "is it a duck or a rabbit?" optical illusion.

The Experiment: Blurring the Lines

To test this, the researcher didn't just use clear pictures. He used a special AI tool (a $\beta$ -VAE) to create a smooth gradient of images.

The Analogy: The Color Mixer
Imagine you have a bucket of blue paint (the character "So") and a bucket of red paint (the character "n").

Step 1: You pour 100% blue. It's clearly blue.
Step 2: You pour 100% red. It's clearly red.
Step 3: You mix them. You get purple. Then darker purple. Then lighter purple.

The researcher created 15 versions of these characters, ranging from "100% So" to "100% n," with every tiny shade of "maybe-So-maybe-n" in between.

He then asked two groups to look at these blurry images and guess what they were:

Humans: Real people taking a survey.
AI Models: Two famous AI chatbots (GPT and Gemini).

Part 1: The "Shape-Only" Test (Looking at the Blurry Blob)

The Setup: The AI and humans were shown only the single, blurry character. No other words, no context. Just the blob.

The Result:

Humans: As the image got more like "n," humans smoothly switched their votes from "So" to "n." It was a clean, logical line.
The AI: The AI was weird.
- One AI (GPT) kept insisting it was "So" even when the image was almost 100% "n." It was stubborn.
- The other AI (Gemini) was confused and didn't switch its vote as smoothly as humans did.

The Takeaway: Even when the picture is clear enough for a human to be 100% sure, the AI is still hesitating or guessing wrong based on its own internal biases. They don't see the world the same way.

Part 2: The "Context" Test (The Sentence Puzzle)

The Setup: Now, the researcher put that same blurry character inside a word.

Example: The word "Dance" (ダンス).
He replaced the middle character with the blurry blob.
Scenario A: The word is "Dance" (ダンス). The context strongly suggests the blob is "n".
Scenario B: The word is "So-so" (ソソ). The context suggests the blob is "So".

The Question: Does putting the blurry blob in a sentence help the AI guess correctly, just like it helps humans?

The Result:

Humans: We are great at using context. If we see "D_nce," we instantly know it's "Dance," even if the 'a' is scribbled out.
The AI: It got better, but not perfectly.
- When the word had other clear clues (like another "n" elsewhere in the word), the AI started acting more like a human.
- However, the AI still had its own "personality." Sometimes it ignored the context and stuck to its weird shape-based biases from the first test.

The Analogy: The Detective

Humans are like a detective who looks at the crime scene (the shape) but also checks the alibi (the context). If the alibi is strong, they ignore the blurry fingerprint.
The AI is like a detective who is obsessed with the fingerprint. Even if the alibi says "It's definitely the butler," the AI might still say, "But the fingerprint looks a little like the gardener!"

Why Does This Matter?

You might think, "Well, if the AI gets the right answer 95% of the time, who cares how it thinks?"

The Author's Point:
It matters because accuracy isn't everything.

If an AI makes a mistake, we want to know why.
If an AI is confident but wrong because it ignores context, that's dangerous in real life (like in medical diagnosis or self-driving cars).
This study shows that we can't just test AI by giving them clear pictures. We have to test them when things are blurry and confusing to see if they think like us.

The Conclusion

The paper concludes that AI and Humans are not aligned in how they handle ambiguity.

AI has its own "decision boundaries" that are different from ours.
Context helps, but it doesn't fix the AI's weird brain completely.

The Final Lesson: To truly understand if AI is "safe" or "aligned" with humans, we shouldn't just ask, "Did it get the answer right?" We need to ask, "Did it figure it out the way a human would?" And right now, the answer is: Not quite.

1. Problem Statement

While Vision-Language Models (VLMs) like GPT and Gemini demonstrate high accuracy in text recognition tasks, high performance does not guarantee that these models utilize decision-making processes similar to humans. Specifically, it remains unclear whether VLMs exhibit the same flexibility as humans in using context to resolve visual ambiguity.

In human cognition, recognition involves a dynamic interplay between visual shape information (categorical perception) and contextual constraints (interactive activation). When visual evidence is uncertain, humans can flexibly rely on context to disambiguate input. However, VLMs may follow different, potentially biased decision patterns that are not captured by standard accuracy benchmarks. This paper investigates the behavioral gap between humans and VLMs when resolving ambiguity in Japanese characters, focusing on whether VLMs align with human judgments under varying levels of visual and contextual uncertainty.

2. Methodology

Stimuli Generation ( $\beta$ -VAE)

To create a controlled testbed for visual ambiguity, the authors moved beyond static datasets. They utilized a $\beta$ -Variational Autoencoder ( $\beta$ -VAE) to generate a continuous spectrum of character images.

Target Characters: The study focused on the visually similar Japanese Katakana characters 'ソ' (so) and 'ン' (n), which differ primarily in stroke angle.
Interpolation: A dataset of 364 fonts (covering Japanese and Latin scripts) was used to train the $\beta$ -VAE. The latent representations of 'so' and 'n' were extracted and linearly interpolated in the latent space ( $z_{inter} = (1-\alpha)z_{so} + \alpha z_{n}$ ).
Stimuli: 15 discrete images were generated along the continuum ( $\alpha \in [0, 1]$ ), allowing for fine-grained scanning of decision boundaries.

Experimental Design

The study addressed two Research Questions (RQs) through two experimental conditions:

RQ1: Shape-Only Task (Single Character Recognition)
- Goal: Compare decision boundaries of humans vs. VLMs when processing isolated character shapes.
- Procedure:
  - Humans: 30 participants viewed 150 trials (10 fonts $\times$ 15 $\alpha$ levels) and selected 'so' or 'n'.
  - VLMs: GPT-5.1 and Gemini-2.5-Flash were queried 10 times per stimulus (temperature 1.0) with the same binary choice.
RQ2: Shape-in-Context Task (Word-Level Recognition)
- Goal: Evaluate if VLM responses align with human judgments when the ambiguous character is embedded in a word.
- Stimuli: 24 contextual word conditions were created by replacing a character in a word with the ambiguous glyph $X$ (selected at $\alpha \approx 0.429$ , where human recognition is ~50/50).
- Conditions:
  1. Sole-Occurrence: The word contains no other instances of 'so' or 'n' (relying on lexical constraints).
  2. Co-Occurrence: The word contains additional unambiguous 'so' or 'n' characters elsewhere (providing within-word character cues).
- Procedure:
  - Humans: ~390 participants selected the intended reading of the whole word from multiple-choice options.
  - VLMs: Same models queried with the same word-level options.

3. Key Contributions

Methodological Innovation: Introduction of a $\beta$ -VAE-based interpolation framework to generate continuous visual ambiguity, moving beyond static "hard" datasets to probe graded decision boundaries.
Behavioral Gap Mapping: The first direct comparison of human and VLM decision boundaries in a controlled, minimal-context setting, revealing that high accuracy does not equate to human-like decision patterns.
Contextual Alignment Analysis: A systematic evaluation of how context (lexical vs. character-level cues) shifts VLM behavior, demonstrating that alignment is conditional and not uniform across models.

4. Results

RQ1: Shape-Only Decision Boundaries

Human Behavior: Exhibited a smooth, monotonic increase in 'n' votes as $\alpha$ increased, reaching near-ceiling performance at the unambiguous endpoint ( $\alpha=1.0$ ).
VLM Behavior:
- Gemini: Followed the general trend but saturated below humans, failing to reach ceiling 'n' voting even at $\alpha=1.0$ .
- GPT: Showed a non-monotonic pattern, shifting back toward 'so' at $\alpha=1.0$ (the point visually closest to 'n').
Conclusion: VLM decision boundaries differ significantly from humans. Both models exhibited a residual bias toward 'so' even when the visual evidence was unambiguous, suggesting a lack of human-like categorical perception in isolation.

RQ2: Shape-in-Context Alignment

Sole-Occurrence Context:
- Gemini: Aligned with humans in 'so'-biased contexts but diverged in 'n'-biased contexts (showing a strong bias toward 'n').
- GPT: Diverged from humans in 'so'-biased contexts but aligned in 'n'-biased contexts.
- Observation: Word-level presentation influenced VLMs, but model-specific biases persisted.
Co-Occurrence Context:
- Alignment Improvement: The presence of additional unambiguous characters (co-occurrence cues) generally improved alignment between VLMs and humans.
- GPT: Became significantly more human-aligned in 'so'-biased contexts compared to the sole-occurrence condition.
- Gemini: Remained highly biased toward 'n' in 'n'-biased contexts, even with human-like cues, though it aligned well in 'so'-biased contexts.
Conclusion: Context can shift VLM behavior toward human judgments, but the degree of alignment depends heavily on the type of cue (lexical vs. character-level) and the specific model architecture.

5. Significance and Implications

Beyond Accuracy: The study demonstrates that standard accuracy benchmarks are insufficient for assessing human-AI alignment. Models can be "correct" in aggregate while failing to replicate the process of human decision-making under ambiguity.
Diagnostic Value of Minimal Context: Minimal-context inputs (shape-only tasks) serve as a critical diagnostic tool. They reveal inherent model biases that are often masked in rich-context, high-accuracy scenarios.
Benchmark Design: Future benchmarks for Human-VLM alignment should not rely solely on contextualized tasks. A robust evaluation framework must include deliberately minimal-context conditions to detect decision boundary misalignments and contextualized conditions to test flexibility.
Trust and Safety: Understanding how AI resolves ambiguity is crucial for real-world applications where misinterpretation can influence trust and downstream decisions. The findings suggest that VLMs may behave unpredictably when visual evidence is weak, even if they perform well on standard datasets.

6. Future Work

The authors propose disentangling word-meaning effects from simple co-occurrence cues by testing pseudo-words or character-shuffled strings. This would clarify whether VLMs are driven by semantic understanding or local visual statistics when resolving ambiguity.