Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

Imagine you have a giant, 15-by-15 checkerboard. Some squares are painted black, and some are left white. Your job is to look at this picture and tell a computer exactly which squares are black.

You might think, "No problem! AI is great at seeing pictures." But this paper reveals a funny, frustrating secret: These AI models are actually terrible at seeing "pure" pictures, but they are amazing at reading "pictures that look like text."

Here is the story of the experiment, explained simply.

The Two Versions of the Same Puzzle

The researchers created the exact same checkerboard patterns in two different ways:

The "Typewriter" Version: They drew the grid using text characters. Empty squares were dots (.) and filled squares were hash signs (#).
- Example: .#.#.
The "Solid Block" Version: They drew the grid using solid black and white squares, with no grid lines and no text. Just a blob of black pixels.
- Example: A solid black square next to a white one.

The Twist: To the computer's "eye" (the visual encoder), both of these are just images. Neither is "real" text that the computer can read like a book. They are both just pixels.

The Results: A Tale of Two Brains

When the researchers asked three top-tier AI models (Claude, ChatGPT, and Gemini) to transcribe these grids, the results were shocking:

When the grid looked like text (.#.#): The AIs were nearly perfect. They got about 90% of the squares right. They could tell you exactly where every black square was.
When the grid looked like solid blocks: The AIs crashed and burned. Their accuracy dropped to 60–70%, and their ability to find the exact black squares plummeted to 30–40%.

The Analogy:
Imagine you are trying to find a specific person in a crowd.

Scenario A (Text): Everyone is wearing a nametag with their name written in big letters. You can read the names and find the person instantly.
Scenario B (Blocks): Everyone is wearing a solid black mask. You can see a crowd of black shapes, but you can't distinguish one person from another or find the specific one you need.

The AI models are like people who are super-fast at reading nametags but are terrible at spotting faces in a crowd of masks.

Why Does This Happen?

The paper suggests that these AIs have a "cheat code" they use when they see text.

The "OCR" Superpower: When the AI sees the # symbol, it doesn't just see a black shape. It recognizes it as a "hash sign." Because it knows this is a character, it uses its reading brain (which is very good at knowing where letters are on a page) to figure out the location. It's essentially doing an internal "Optical Character Recognition" (reading the image as text) to solve a spatial puzzle.
The "Visual" Weakness: When the AI sees a solid black square, it can't use its reading brain. It has to rely on its seeing brain (pure visual processing). The paper shows that the "seeing brain" of current AIs is actually quite fuzzy when it comes to pinpointing exact locations. It can tell you "there's a black blob over there," but it can't tell you "that is exactly square number 4, row 2."

The "Ghost" in the Machine

The researchers found that each AI failed in its own unique, weird way when looking at the solid blocks:

Claude was an under-counter. It saw the black area but thought, "That's too many squares," so it skipped some.
ChatGPT was an over-counter. It got excited and hallucinated extra black squares that didn't exist, making the blobs look bigger than they were.
Gemini was a pattern faker. When the grid got too crowded, it just gave up and drew a random "plus sign" or "L-shape" pattern that looked nothing like the real grid. It was essentially guessing based on what it thought a grid should look like.

The "Magic Label" Experiment

To prove their theory, the researchers tried a middle ground. They took the solid black squares and wrote a tiny "1" inside the black ones and a "0" inside the white ones.

For Claude and Gemini: This worked like magic! The moment they saw the numbers, their performance skyrocketed back to near-perfect. The text "anchored" their vision.
For ChatGPT: It got confused. The tiny text inside the black squares actually made it worse. It seems ChatGPT's "reading brain" and "seeing brain" got tangled up and tripped over each other.

The Big Takeaway

This paper teaches us a humbling lesson about modern AI:

These models aren't "seeing" the world the way humans do. They are mostly "reading" the world. If you give them a picture that looks like a document or a spreadsheet, they are geniuses. But if you give them a picture that is just shapes, colors, and blobs without any text, their spatial reasoning falls apart.

It's like having a librarian who can find any book on a shelf if the spines have titles, but who gets completely lost if the books are all wrapped in plain brown paper. Until we fix this, we can't fully trust these AIs to navigate the visual world unless we give them text labels to hold onto.

1. Problem Statement

The paper challenges the implicit assumption that Vision-Language Models (VLMs) possess robust, native spatial reasoning capabilities over arbitrary visual inputs. While VLMs excel at describing images and interpreting charts, this work investigates whether they can accurately localize and count non-textual visual elements (specifically, filled cells in a binary grid) when those elements lack "textual identity."

The core hypothesis is that VLMs may not rely on genuine visual spatial reasoning for fine-grained tasks but instead depend on an internal text-recognition pathway (essentially performing implicit OCR). If the visual stimulus cannot be mapped to recognizable text tokens, the model's spatial localization capabilities may collapse.

2. Methodology

Experimental Design

The authors constructed a controlled experiment using 15 binary grids of size $15 \times 15$ (225 cells each). The grids varied in fill density from 10.7% to 41.8%. Each grid was rendered in two distinct visual formats, both presented as PNG images to ensure they passed through the same visual encoder:

Text-Symbol Condition: Cells were rendered as monospace text characters: . (empty) and # (filled).
Filled-Squares Condition: Cells were rendered as solid black or white squares without gridlines. Adjacent filled cells merged into contiguous black regions.

Models Evaluated

Three frontier VLMs from different organizations were tested:

Claude Opus (Anthropic)
ChatGPT 5.2 (OpenAI)
Gemini 3 Thinking (Google)

Evaluation Protocol

Task: The models were instructed to transcribe the grid layout (identifying which cells were filled) based solely on visual inspection. The use of code or image-processing tools was explicitly prohibited.
Metrics:
- Cell Accuracy: Fraction of all 225 cells correctly classified.
- Black-Cell F1 Score: The harmonic mean of precision and recall for detecting filled cells. This is the primary metric, as high accuracy can be inflated by the large number of empty (background) cells.

Ablation Studies

To determine if the performance gap was binary (text vs. non-text) or graded, the authors introduced intermediate conditions:

Unicode Squares: Using valid Unicode text tokens for squares (□ and ■).
Text-in-Squares: Embedding "0" and "1" text labels inside the filled and empty squares, respectively, to reintroduce textual anchors while maintaining the square visual appearance.

3. Key Results

The "Text-vs-Squares" Gap

The most significant finding is a dramatic performance collapse when moving from text symbols to filled squares, despite identical information content and visual encoding paths.

Text Condition: Claude and ChatGPT achieved ~91% cell accuracy and ~84% F1. Gemini achieved 84% accuracy and 63% F1.
Filled-Squares Condition: All three models collapsed to 60–73% accuracy and 29–39% F1.
The Gap: The F1 score dropped by 34 to 54 points across models when text symbols were replaced by filled squares.

Model-Specific Failure Modes

Each model exhibited a distinct failure pattern in the squares condition:

Claude Opus: Systematic under-counting. It detected the approximate region of filled cells but failed to localize individual boundaries, predicting fewer cells than existed.
ChatGPT 5.2: Massive over-counting. At high densities, it hallucinated extra filled cells, often expanding clusters and losing grid dimensions (e.g., generating 16–17 characters per row).
Gemini 3 Thinking: Template hallucination. At moderate-to-high densities, it abandoned the input entirely, generating stereotyped geometric patterns (crosses, L-shapes) unrelated to the actual grid.

Ablation Findings (Graded Continuum)

Unicode Squares: Performance was intermediate (69–77% F1), suggesting the gap is not binary but depends on the "text-recognition confidence" of the specific token.
Text-in-Squares:
- Claude & Gemini: Performance recovered significantly (up to 100% F1 on sparse grids), proving that embedding text anchors restores spatial reasoning.
- ChatGPT: Performance degraded (dropping to 51% F1), indicating that for this model, the visual complexity of text-on-background actively interferes with its processing pathways.

4. Key Contributions

Identification of a Fundamental Limitation: The paper demonstrates that current VLMs lack robust native spatial reasoning for non-textual elements. Their high performance on spatial tasks is largely mediated by a text-recognition pathway.
Cross-Architecture Replication: The phenomenon was replicated across three independent model families (Anthropic, OpenAI, Google) with different architectures and training data, suggesting this is a structural property of current VLM designs rather than an idiosyncrasy of a single system.
Characterization of Failure Modes: The study categorizes specific failure modes (under-counting, over-counting, template hallucination) and links them to the capacity limits of the text-recognition pathway versus the visual-feature pathway.
Ablation of Text-Visual Interactions: The "Text-in-Squares" experiment reveals that text-visual interactions are model-specific; adding text can help some models (Claude, Gemini) but actively harm others (ChatGPT).

5. Significance and Implications

Re-evaluating Benchmarks: Current benchmarks relying on text-heavy inputs may substantially overestimate a VLM's true spatial reasoning capabilities. Applications requiring precise localization of non-textual objects (e.g., medical imaging, autonomous driving, scientific visualization) may face severe limitations.
The "Two-Pathway" Hypothesis: The authors propose that VLMs function with two implicit pathways:
1. A Text-Recognition Pathway (high-fidelity, preserves spatial position via token mapping).
2. A Visual-Feature Pathway (lower-fidelity, captures semantic clusters but loses precise coordinate-level data).
Future Directions: The paper suggests that bridging this gap requires either:
- Textual Scaffolding: Embedding text anchors in visual tasks (though this must be validated per model).
- Architectural Changes: Training visual encoders with explicit spatial-coordinate prediction objectives rather than just image-caption matching.
- Discrete Visual Tokens: Introducing learned discrete representations for non-textual visual content to allow the text-recognition pathway to process them.

In conclusion, the paper argues that VLMs "see" squares poorly because they rely on reading them. When the visual element cannot be mapped to a familiar text token, the model's spatial reasoning degrades severely, exposing a critical gap in multimodal understanding.