TopoBench: Benchmarking LLMs on Hard Topological Reasoning

This paper introduces TopoBench, a benchmark for evaluating large language models on topological grid puzzles, revealing that their poor performance stems primarily from difficulties in extracting and maintaining spatial constraints rather than inherent reasoning limitations.

Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe, Anthony Ventresque, Noel O'Connor, Fergal Reid

Published 2026-03-13
📖 6 min read🧠 Deep dive

Imagine you have a brilliant new student, let's call him "AI," who has read every book in the library. He can solve complex math problems, write poetry, and debate philosophy. But when you hand him a simple grid puzzle—like a maze where he has to draw a single loop without crossing lines, or connect islands with bridges without making a loop—he gets stuck. He stares at the grid, writes a long, confident essay about how he thinks he's solving it, and then draws a solution that breaks the rules.

This paper, TopoBench, is like a specialized gym designed to test exactly how good AI is at these "spatial logic" puzzles. The researchers wanted to find out: Is the AI bad at thinking, or is it just bad at reading the map?

Here is the breakdown of their findings, using some everyday analogies.

1. The Test: TopoBench

The researchers created a new test called TopoBench. Think of it as a "driver's license exam" for AI, but instead of driving a car, the AI has to navigate six different types of grid puzzles (like Flow Free, Bridges, and Loopy).

  • The Goal: The AI must maintain "global rules." For example, in a bridge puzzle, if you connect two islands, you can't accidentally create a closed loop. The AI has to remember the entire board state while making one small move at a time.
  • The Result: Even the smartest AIs (like GPT-5 and DeepSeek) failed miserably on the hard levels. They solved less than 25% of the hardest puzzles. It's like a genius who can write a novel but can't figure out how to tie their own shoelaces without tripping.

2. The Diagnosis: Why did they fail?

The researchers didn't just look at the wrong answers; they looked at the AI's "thinking process" (called Chain of Thought). They acted like detectives, annotating 750 failed attempts to see where the AI went wrong.

They found four main suspects, but only two were the real killers:

  • The "Overconfident Beginner" (Premature Commitment): The AI picks a wrong path early on (like taking a left turn when it should have gone right) and then stubbornly keeps walking down that dead end for 20 steps, refusing to admit it's wrong.
  • The "Forgetful Architect" (Constraint Forgetting): The AI makes a move that breaks the rules (like building a bridge over water where it's not allowed) but doesn't realize it. It keeps building its house on a foundation that doesn't exist.
  • The "Stuck Record" (Repeated Reasoning): The AI gets stuck in a loop, saying the same thing over and over. The researchers found this was actually a symptom of the AI being lost, not the cause of the failure.
  • The "Map Reader" (State Tracking): The AI loses track of where it is on the board.

The Big Surprise: The most common mistake (getting stuck in a loop) wasn't the most dangerous. The rarest mistake (forgetting the rules) was the most deadly. It's like a driver who talks to themselves a lot (harmless) vs. a driver who forgets to look at the stop sign (catastrophic).

3. The Cure: How to fix the AI

The researchers tried three different "therapies" to help the AI.

Therapy A: Change the Language (Input Format)

The puzzles were originally written as ASCII art (text characters like | and -). This is like trying to read a blueprint where the lines are drawn with messy handwriting.

  • The Fix: They switched to a clean, structured format (like a spreadsheet or JSON list).
  • The Result: This helped a lot! It's like giving the AI a clear, digital map instead of a scribbled napkin. The AI suddenly understood the grid much better.

Therapy B: The "Cheat Sheet" (Tool Augmentation)

This was the most interesting part. They gave the AI a "tool" that acted like an external calculator.

  • The Setup: Instead of the AI trying to remember the whole board in its head, they let the AI ask a tool: "Hey, how many bridges does this island need right now?" or "Is this move legal?"
  • The Twist: They tested two types of tools:
    1. The Visual Tool: The tool showed the AI a picture of the board (ASCII art).
    2. The Data Tool: The tool gave the AI a simple list of numbers (e.g., "Island A needs 2 bridges").
  • The Result: The Visual Tool actually made the AI worse. The AI got distracted by the picture. The Data Tool made the AI much better.
  • The Lesson: The AI doesn't need to "see" the puzzle to solve it; it needs to understand the rules. The bottleneck isn't that the AI can't reason; it's that the AI is terrible at translating a picture into a set of rules. Once the rules are handed to it in plain English (or numbers), the AI is a genius.

Therapy C: Better Instructions (Prompts)

They tried telling the AI, "Please plan ahead!" or "Don't commit too early!"

  • The Result: It didn't work. The AI ignored the advice. It's like telling a toddler, "Don't eat the cookie," while holding a cookie in front of them. The AI's internal "thinking engine" overpowers the simple instructions.

The Final Verdict

The paper concludes that Large Language Models are not bad at logic; they are bad at reading the map.

Imagine a brilliant detective who can solve a murder mystery if you give them a list of facts. But if you give them a messy crime scene photo and ask them to "figure it out," they get confused by the visual noise.

The AI's problem isn't that it can't do the math or the logic. The problem is that it struggles to extract the rules from the visual grid. Once you translate the grid into a clean list of constraints (like "Island A needs 2 bridges"), the AI solves the puzzle perfectly.

In short: We don't need smarter AIs; we need better ways to translate pictures into instructions for them.