Imagine you have a giant, 15-by-15 checkerboard. Some squares are painted black, and some are left white. Your job is to look at this picture and tell a computer exactly which squares are black.
You might think, "No problem! AI is great at seeing pictures." But this paper reveals a funny, frustrating secret: These AI models are actually terrible at seeing "pure" pictures, but they are amazing at reading "pictures that look like text."
Here is the story of the experiment, explained simply.
The Two Versions of the Same Puzzle
The researchers created the exact same checkerboard patterns in two different ways:
- The "Typewriter" Version: They drew the grid using text characters. Empty squares were dots (
.) and filled squares were hash signs (#).- Example:
.#.#.
- Example:
- The "Solid Block" Version: They drew the grid using solid black and white squares, with no grid lines and no text. Just a blob of black pixels.
- Example: A solid black square next to a white one.
The Twist: To the computer's "eye" (the visual encoder), both of these are just images. Neither is "real" text that the computer can read like a book. They are both just pixels.
The Results: A Tale of Two Brains
When the researchers asked three top-tier AI models (Claude, ChatGPT, and Gemini) to transcribe these grids, the results were shocking:
- When the grid looked like text (
.#.#): The AIs were nearly perfect. They got about 90% of the squares right. They could tell you exactly where every black square was. - When the grid looked like solid blocks: The AIs crashed and burned. Their accuracy dropped to 60–70%, and their ability to find the exact black squares plummeted to 30–40%.
The Analogy:
Imagine you are trying to find a specific person in a crowd.
- Scenario A (Text): Everyone is wearing a nametag with their name written in big letters. You can read the names and find the person instantly.
- Scenario B (Blocks): Everyone is wearing a solid black mask. You can see a crowd of black shapes, but you can't distinguish one person from another or find the specific one you need.
The AI models are like people who are super-fast at reading nametags but are terrible at spotting faces in a crowd of masks.
Why Does This Happen?
The paper suggests that these AIs have a "cheat code" they use when they see text.
- The "OCR" Superpower: When the AI sees the
#symbol, it doesn't just see a black shape. It recognizes it as a "hash sign." Because it knows this is a character, it uses its reading brain (which is very good at knowing where letters are on a page) to figure out the location. It's essentially doing an internal "Optical Character Recognition" (reading the image as text) to solve a spatial puzzle. - The "Visual" Weakness: When the AI sees a solid black square, it can't use its reading brain. It has to rely on its seeing brain (pure visual processing). The paper shows that the "seeing brain" of current AIs is actually quite fuzzy when it comes to pinpointing exact locations. It can tell you "there's a black blob over there," but it can't tell you "that is exactly square number 4, row 2."
The "Ghost" in the Machine
The researchers found that each AI failed in its own unique, weird way when looking at the solid blocks:
- Claude was an under-counter. It saw the black area but thought, "That's too many squares," so it skipped some.
- ChatGPT was an over-counter. It got excited and hallucinated extra black squares that didn't exist, making the blobs look bigger than they were.
- Gemini was a pattern faker. When the grid got too crowded, it just gave up and drew a random "plus sign" or "L-shape" pattern that looked nothing like the real grid. It was essentially guessing based on what it thought a grid should look like.
The "Magic Label" Experiment
To prove their theory, the researchers tried a middle ground. They took the solid black squares and wrote a tiny "1" inside the black ones and a "0" inside the white ones.
- For Claude and Gemini: This worked like magic! The moment they saw the numbers, their performance skyrocketed back to near-perfect. The text "anchored" their vision.
- For ChatGPT: It got confused. The tiny text inside the black squares actually made it worse. It seems ChatGPT's "reading brain" and "seeing brain" got tangled up and tripped over each other.
The Big Takeaway
This paper teaches us a humbling lesson about modern AI:
These models aren't "seeing" the world the way humans do. They are mostly "reading" the world. If you give them a picture that looks like a document or a spreadsheet, they are geniuses. But if you give them a picture that is just shapes, colors, and blobs without any text, their spatial reasoning falls apart.
It's like having a librarian who can find any book on a shelf if the spines have titles, but who gets completely lost if the books are all wrapped in plain brown paper. Until we fix this, we can't fully trust these AIs to navigate the visual world unless we give them text labels to hold onto.
Get papers like this in your inbox
Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.