TaoBench: Do Automated Theorem Prover LLMs Generalize Beyond MathLib?

The paper introduces TaoBench, a novel benchmark derived from Terence Tao's *Analysis I* that evaluates automated theorem provers on bespoke mathematical constructions, revealing a significant 26% performance drop compared to standard MathLib problems and highlighting that current systems' primary limitation is their inability to generalize across different definitional frameworks rather than the inherent difficulty of the tasks.

Alexander K Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, Wei Wang

Published 2026-03-16
📖 4 min read☕ Coffee break read

The Big Idea: The "Specialized Chef" Problem

Imagine you have trained a world-class chef (an AI) to cook amazing meals, but you've only ever taught them using one specific cookbook (called MathLib). This cookbook has very specific rules: it calls a tomato a "red fruit," uses a specific type of knife, and organizes recipes by color.

Because the chef has practiced with this book for so long, they can cook a perfect tomato salad in seconds. They are a genius at following this book.

The Problem:
Now, imagine you ask this chef to cook the exact same salad, but this time you give them a different cookbook (written by the famous mathematician Terence Tao). In this new book:

  • A tomato is called a "red berry."
  • The knife is held differently.
  • The recipes are organized by the time of day.

Even though the ingredients and the final dish are exactly the same, the chef freezes. They get confused by the new names and the new organization. They can't figure out how to start because they are so used to the old rules.

The Paper's Discovery:
The researchers built a test called TAOBENCH to prove this. They took 150 math problems from Terence Tao's textbook, which uses a "from-scratch" way of defining math (building concepts like numbers and sets from the ground up).

They then created a "translation" of these same problems into the standard MathLib language that the AI chefs are used to.

The Result:

  • On the standard MathLib problems: The AI chefs solved about 70% of them. They were in their element.
  • On the Tao problems (the new language): The same AI chefs solved only about 44% of them.

The Takeaway:
The AI isn't bad at math. It's just bad at adapting. It has memorized the "dialect" of the standard library so well that it can't speak the "dialect" of a new, but mathematically identical, framework.


How They Did It: The "Agentic Pipeline"

Building this test was hard. You can't just copy-paste a math problem from a textbook into a computer; the computer needs a whole "kitchen" (definitions, tools, and rules) to understand it.

The researchers built a robot team (an Agentic Pipeline) to do the heavy lifting:

  1. The Librarian: It went into the massive textbook and found exactly which definitions a specific problem needed, ignoring everything else.
  2. The Translator: It tried to rewrite the "Tao" version of the problem into the "MathLib" version.
  3. The Editor: It checked if the translation was actually correct mathematically. If the robot changed the meaning of the problem while translating, the Editor threw it out and tried again.

This ensured that when the AI failed on the Tao version, it wasn't because the problem was harder, but purely because the language was different.


Why This Matters: The "Real World" Gap

Why should we care if an AI can't switch cookbooks?

In the real world, mathematics is exploratory. When mathematicians discover something new, they often have to invent their own definitions and rules because the standard "cookbooks" don't have them yet.

  • Current AI: Like a chef who can only cook if you give them the exact same cookbook they trained on. If you ask them to invent a new recipe or use a new ingredient, they fail.
  • The Goal: We want AI that can be a research partner. We want an AI that can look at a new, weird definition, understand it, and help prove theorems, even if it's never seen that specific "dialect" before.

The Conclusion:
The paper shows that current "State-of-the-Art" AI theorem provers are actually quite fragile. They are over-specialized. They are great at solving puzzles in a specific room (MathLib), but if you move the furniture slightly (change the definitions), they get lost.

TAOBENCH is a new gym for these AIs. It forces them to learn how to be flexible, so they can eventually help humans do real, cutting-edge research where the rules haven't been written down yet.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →