TaoBench: Do Automated Theorem Prover LLMs Generalize Beyond MathLib?

The Big Idea: The "Specialized Chef" Problem

Imagine you have trained a world-class chef (an AI) to cook amazing meals, but you've only ever taught them using one specific cookbook (called MathLib). This cookbook has very specific rules: it calls a tomato a "red fruit," uses a specific type of knife, and organizes recipes by color.

Because the chef has practiced with this book for so long, they can cook a perfect tomato salad in seconds. They are a genius at following this book.

The Problem:
Now, imagine you ask this chef to cook the exact same salad, but this time you give them a different cookbook (written by the famous mathematician Terence Tao). In this new book:

A tomato is called a "red berry."
The knife is held differently.
The recipes are organized by the time of day.

Even though the ingredients and the final dish are exactly the same, the chef freezes. They get confused by the new names and the new organization. They can't figure out how to start because they are so used to the old rules.

The Paper's Discovery:
The researchers built a test called TAOBENCH to prove this. They took 150 math problems from Terence Tao's textbook, which uses a "from-scratch" way of defining math (building concepts like numbers and sets from the ground up).

They then created a "translation" of these same problems into the standard MathLib language that the AI chefs are used to.

The Result:

On the standard MathLib problems: The AI chefs solved about 70% of them. They were in their element.
On the Tao problems (the new language): The same AI chefs solved only about 44% of them.

The Takeaway:
The AI isn't bad at math. It's just bad at adapting. It has memorized the "dialect" of the standard library so well that it can't speak the "dialect" of a new, but mathematically identical, framework.

How They Did It: The "Agentic Pipeline"

Building this test was hard. You can't just copy-paste a math problem from a textbook into a computer; the computer needs a whole "kitchen" (definitions, tools, and rules) to understand it.

The researchers built a robot team (an Agentic Pipeline) to do the heavy lifting:

The Librarian: It went into the massive textbook and found exactly which definitions a specific problem needed, ignoring everything else.
The Translator: It tried to rewrite the "Tao" version of the problem into the "MathLib" version.
The Editor: It checked if the translation was actually correct mathematically. If the robot changed the meaning of the problem while translating, the Editor threw it out and tried again.

This ensured that when the AI failed on the Tao version, it wasn't because the problem was harder, but purely because the language was different.

Why This Matters: The "Real World" Gap

Why should we care if an AI can't switch cookbooks?

In the real world, mathematics is exploratory. When mathematicians discover something new, they often have to invent their own definitions and rules because the standard "cookbooks" don't have them yet.

Current AI: Like a chef who can only cook if you give them the exact same cookbook they trained on. If you ask them to invent a new recipe or use a new ingredient, they fail.
The Goal: We want AI that can be a research partner. We want an AI that can look at a new, weird definition, understand it, and help prove theorems, even if it's never seen that specific "dialect" before.

The Conclusion:
The paper shows that current "State-of-the-Art" AI theorem provers are actually quite fragile. They are over-specialized. They are great at solving puzzles in a specific room (MathLib), but if you move the furniture slightly (change the definitions), they get lost.

TAOBENCH is a new gym for these AIs. It forces them to learn how to be flexible, so they can eventually help humans do real, cutting-edge research where the rules haven't been written down yet.

1. Problem Statement

Current Automated Theorem Proving (ATP) systems, particularly those based on Large Language Models (LLMs), are trained and evaluated almost exclusively on benchmarks derived from MathLib, the standard mathematical library for the Lean 4 proof assistant. While these models achieve high success rates on MathLib-formulated problems, real-world mathematical research is often exploratory, relying on bespoke definitions, custom axioms, and non-standard constructions that deviate from the MathLib framework.

The core problem addressed is the generalization gap: Do current ATP models possess the ability to reason over novel definitional frameworks, or are they merely memorizing patterns specific to the MathLib ecosystem? Existing benchmarks fail to test this because they do not isolate the difficulty of the mathematical problem from the difficulty of the specific formal syntax and definitions used.

2. Methodology

The authors introduce TAOBENCH, a novel benchmark designed to evaluate the robustness of ATP models against definitional shifts. The methodology involves three key components:

A. Benchmark Construction (TAOBENCH)

Source Material: The benchmark is derived from 150 exercises in Terence Tao's Analysis I, specifically from his custom Lean formalization (teorth/analysis).
Definitional Framework: Tao's formalization constructs core concepts (e.g., natural numbers, sets, real numbers, convergence) from scratch using custom inductive types and structures, often diverging significantly from MathLib's standard definitions (e.g., defining sets as predicates vs. specific structures, or defining convergence via $\epsilon$ - $\delta$ vs. filters).
Agentic Extraction Pipeline: To ensure fair evaluation, the authors built an agentic pipeline to automatically extract a minimal, self-contained, and compilable Lean environment for each exercise.
- This pipeline uses the JiXia static analysis tool to identify dependencies.
- It employs a file-lookup tool and a Lean verifier to iteratively retrieve source files, resolve syntax sugar, and fix compilation errors.
- The result is a snippet containing only the necessary definitions and lemmas to compile the problem statement, preventing models from relying on external MathLib imports.

B. Paired Control (TAOBENCHMATHLIB)

To isolate the variable of "definitional framework" from "mathematical difficulty," the authors created TAOBENCHMATHLIB.

Translation: Each TAOBENCH problem was translated into a mathematically equivalent formulation using standard MathLib definitions.
Pipeline: This translation used an automated pipeline involving:
1. Rewriting: An LLM (GPT-5.1) with web search access to retrieve canonical MathLib definitions.
2. Equivalence Checking: A verification step using JiXia to extract target proof states and an LLM to verify mathematical equivalence between the Tao and MathLib versions.
3. Expert Verification: Human experts (with Analysis I and Lean experience) manually reviewed and corrected translations to ensure semantic fidelity.
Result: A paired dataset where the mathematical content is identical, but the formal representation differs (Bespoke/Tao vs. Standard/MathLib).

C. Evaluation Setup

The authors evaluated several State-of-the-Art (SOTA) ATP models, including:

Specialized ATPs: DeepSeek-Prover-V2, Goedel-Prover-V2, Kimina-Prover.
Frontier Foundation Models: GPT-5.1, Gemini 3 Pro, DeepSeek-V3.2.
Metrics: Pass@128 (for specialized models) and Pass@8 (for foundation models).

3. Key Contributions

TAOBENCH: The first Lean benchmark specifically targeting generalization beyond the MathLib definitional framework, utilizing a "from-scratch" formalization of undergraduate analysis.
TAOBENCHMATHLIB: A paired control dataset providing mathematically equivalent MathLib formulations, enabling the isolation of definitional framework effects from problem difficulty.
Agentic Data Construction Framework: A novel pipeline for automatically extracting compilable, self-contained local environments from large formalized textbooks, solving the context-length and dependency management challenges in benchmark creation.
Empirical Evidence of Generalization Failure: Demonstration that current SOTA models suffer a significant performance drop when moving from MathLib to novel definitional frameworks, even when the underlying mathematics is identical.

4. Key Results

Performance Drop: While models perform well on TAOBENCHMATHLIB (often >65% pass rate), their performance drops by an average of ~26% on the definitionally equivalent TAOBENCH.
- Example: Goedel-Prover-V2-32B achieved 72.67% on MathLib but only 49.33% on Tao's framework.
- Example: DeepSeek-Prover-V2-7B achieved 69.33% on MathLib but 41.33% on Tao.
Context Length Sensitivity: The performance gap widens as the number of local definitions in the context increases.
- For problems with 0 local definitions, the gap is negligible.
- For problems with $\ge 10$ local definitions, the gap exceeds 50 percentage points. Models fail to effectively integrate and reason over unfamiliar definitions provided in the context.
Frontier Models vs. Specialized ATPs:
- Specialized ATPs excel on MathLib but struggle on Tao.
- Frontier foundation models (e.g., GPT-5.1) perform worse than specialized ATPs on MathLib (lacking specific training) but perform comparably or better on TAOBENCH. This suggests frontier models have superior in-context learning capabilities for novel definitions, whereas specialized ATPs are brittle outside their training distribution.
Case Studies:
- Nat.backwards_induction: Models failed on Tao's bespoke inductive type but succeeded on MathLib's standard Nat, highlighting brittleness to arithmetic infrastructure shifts.
- Convergesto.squeeze: Models succeeded on Tao's explicit $\epsilon$ - $\delta$ definition but failed on MathLib's filter-based Tendsto abstraction, suggesting some models struggle with high-level abstractions when not explicitly trained on them.

5. Significance and Implications

Limitation of Current Benchmarks: The paper argues that current benchmarks (like MiniF2F) conflate "mathematical reasoning ability" with "familiarity with MathLib." High scores on these benchmarks do not guarantee applicability to real-world research mathematics, which often requires defining new concepts.
Training Paradigm Flaw: The results indicate that current fine-tuning strategies (RLHF, synthetic data generation) create models that are highly optimized for a specific formal ecosystem but lack the flexibility to generalize to new definitional frameworks.
Future Directions: TAOBENCH provides a concrete testbed for developing provers that can handle "exploratory mathematics." It highlights the need for training data that includes diverse definitional frameworks and for models that can better leverage in-context information to ground reasoning in novel definitions.

In conclusion, the paper demonstrates that definitional framework sensitivity is a primary bottleneck for ATPs. The ability to prove theorems in a standard library does not translate to the ability to prove equivalent theorems in a bespoke, research-oriented formalization.

TaoBench: Do Automated Theorem Prover LLMs Generalize Beyond MathLib?

The Big Idea: The "Specialized Chef" Problem

How They Did It: The "Agentic Pipeline"

Why This Matters: The "Real World" Gap

1. Problem Statement

2. Methodology

A. Benchmark Construction (TAOBENCH)

B. Paired Control (TAOBENCHMATHLIB)

C. Evaluation Setup

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank