A Minimal Agent for Automated Theorem Proving

Imagine you are trying to solve a incredibly difficult math puzzle, but instead of a human, you have an AI assistant. The goal is to get this AI to write a perfect, step-by-step proof that a computer can verify as 100% correct.

This paper introduces a new, surprisingly simple way to build that AI assistant, which the authors call AxProverBase.

Here is the breakdown using a simple analogy: The "Architect, Inspector, and Librarian" Team.

The Problem: The "One-Shot" Trap

Most advanced AI theorem provers are like massive, over-engineered factories. They use complex reinforcement learning, huge databases, and thousands of attempts to solve a single problem. They are expensive, hard to update, and often break when the rules of the game (the programming language) change slightly.

The authors asked: Do we really need a factory? Or can we just build a smart, efficient workshop?

The Solution: A Simple Three-Step Loop

The authors built a "minimal agent" that works like a small, highly effective team of three people working in a loop.

1. The Architect (The Proposer)

This is the AI that tries to write the proof.

How it works: It looks at the math problem and tries to write the code to solve it.
The Twist: It doesn't just guess once. It's allowed to try, fail, learn, and try again.

2. The Inspector (The Reviewer & Compiler)

This is the computer system that checks the Architect's work.

The Compiler: It tries to "build" the code. If the code has a typo or a logic error, the compiler says, "This doesn't work. Here is the specific error message."
The Human Reviewer: Sometimes, code compiles but is still cheating (like using a placeholder like "I'll fix this later" instead of actually solving it). The Reviewer checks to make sure the proof is honest and complete.

3. The Librarian (The Memory & Tools)

This is the most critical part. In the past, if an AI failed, it would just forget and try again, often making the same mistake.

The Notebook (Memory): The team keeps a "lab notebook." If the Architect fails, the Librarian writes down why it failed and what was learned. When the Architect tries again, it reads the notebook. It says, "Oh, I tried that before, and it didn't work because I assumed the numbers were commutative. I need to try a different approach."
The Search Tools: If the Architect is stuck, the Librarian can quickly search a massive library of known math facts (called Mathlib) or even the internet to find a clue.

The "Aha!" Moments from the Research

The paper tested this simple team against the most complex, expensive AI systems currently in existence. Here is what they found:

Iterative Refinement is King: The biggest factor in success wasn't having a "smarter" AI model; it was the ability to try, fail, learn, and try again. It's like the difference between a student who takes a test once and fails, versus a student who takes the test, gets the answers back, studies the mistakes, and takes it again until they get an A.
Memory Prevents Spinning Wheels: Without the "Lab Notebook" (memory), the AI would get stuck in a loop, making the same mistake over and over. The memory system stopped this, saving time and money.
Tools are Nice, but Not Magic: Having a search engine (to look up math facts) helped, but it wasn't as important as the ability to iterate and remember past mistakes.
Simplicity Wins: This simple, open-source system performed just as well as (and sometimes better than) the massive, complex systems, but at a tiny fraction of the cost. It's like driving a reliable, fuel-efficient sedan that gets you to the destination just as fast as a limousine, but costs a fraction of the price to run.

Why Does This Matter?

Currently, using AI to prove math theorems is like trying to launch a rocket: it requires a huge team, millions of dollars, and specialized infrastructure.

This paper shows that you can build a reliable, affordable, and easy-to-use system that anyone can run. Because the system is so simple, it can easily adapt when the math software updates. It also means that as AI models get smarter in the future, this simple "team" will automatically get better without needing to be rebuilt.

In short: The authors proved that you don't need a super-complex AI to solve hard math problems. You just need a smart AI that is allowed to make mistakes, learn from them, and keep a good notebook.

1. Problem Statement

Automated Theorem Proving (ATP) using Large Language Models (LLMs) has seen rapid progress, particularly in formalizing mathematics in the Lean 4 proof assistant. However, current state-of-the-art (SOTA) systems face significant barriers to practical adoption:

Complexity: Many systems rely on intricate architectures involving massive synthetic datasets, reinforcement learning (RL), complex tree-search algorithms, or multi-stage decomposition pipelines.
Fragility: These systems are often tightly coupled with specific versions of Lean and its library (Mathlib), requiring extensive retraining or fine-tuning when the underlying tools update.
Cost & Efficiency: High-performance provers often require massive computational budgets (e.g., thousands of passes per theorem) or expensive proprietary models.
Evaluation Ambiguity: It is difficult to distinguish whether performance gains come from architectural innovations or simply from using a more capable underlying LLM.

The authors propose a minimal agentic baseline to systematically evaluate the core components driving ATP success, aiming to create a simple, cost-effective, and adaptable prover that competes with complex SOTA systems.

2. Methodology: AxProverBase

The authors introduce AxProverBase, a modular, minimal agent architecture designed to isolate and test the three primary drivers of success in modern ATP: Iterative Refinement, Memory, and Tool Use.

Architecture Components

Proposer Agent:
- An LLM tasked with writing Lean 4 code to prove a target theorem.
- It operates in a ReAct (Reasoning + Acting) style, capable of making parallel tool calls before generating a proof.
- Tools:
  - Library Search: A custom deployment of LeanSearch to query Mathlib for relevant lemmas and tactics.
  - Web Search: Integration with Tavily to find proof strategies (used to address the challenge of writing compilable code rather than just informal reasoning).
Review System:
- Compiler: Programmatically compiles the proposed Lean code. If it fails, it returns specific error messages. If it compiles but contains sorry (placeholders), it extracts the remaining goals.
- Reviewer Agent: An LLM that verifies the proof integrity. It checks that the theorem statement was not altered and ensures no "tricks" (like incomplete proofs that compile due to apply? tactics) are used.
Memory System:
- Provides context from previous failed attempts to prevent the agent from repeating mistakes (entering "cycles").
- Implementations compared:
  - No Memory: No context from past attempts.
  - History: Feeds the last $n$ attempts (reasoning, code, feedback) directly.
  - Self-Managed Context: The agent maintains a "lab notebook" (a condensed summary of lessons learned) which is updated after every iteration. This was found to be the most efficient approach.

Iterative Loop

The agent operates in a loop:

Proposer generates a proof attempt (potentially using tools).
Compiler checks for errors/sorry.
Reviewer validates logical integrity.
Feedback (errors, goals, or validation) is processed by the Memory module.
Proposer refines the proof based on feedback and memory context.
The cycle repeats until the proof is complete or the iteration budget is exhausted.

3. Key Contributions & Findings

The paper presents a series of ablation studies and benchmarks to quantify the impact of each component.

A. Impact of Components (Ablation Studies)

Iterative Refinement is Critical: The single most significant factor for performance is the ability to refine proofs iteratively based on compiler feedback. A simple iterative approach with feedback alone outperformed many complex SOTA systems that rely on single-shot generation.
Memory Prevents Cycles: Without memory, agents often repeat the same errors. A self-managed context (summarizing lessons learned) proved superior to feeding raw history. It improved performance by ~7% while reducing costs by 20% compared to raw history, as it kept the context window manageable and focused.
Tools are Secondary but Helpful: Search tools (Mathlib and Web) improved performance but were far less impactful than the feedback loop and memory mechanisms. They primarily helped in identifying correct lemmas but did not solve the core logical reasoning gaps.
Model Capability + Scaffolding: "Smarter" models (e.g., Claude Opus 4.5, Gemini Pro) benefited disproportionately from the agentic framework. The framework acts as a force multiplier, allowing powerful models to leverage their reasoning capabilities more effectively than in single-shot modes.

B. Benchmark Results

The authors evaluated AxProverBase (using Claude Opus 4.5 with a 32k token thinking budget and 50 iterations) against SOTA provers on several datasets:

Dataset	AxProverBase Performance	Comparison to SOTA
PutnamBench	54.7% (pass@1)	Outperforms non-agentic provers; competitive with complex systems like Hilbert (70% pass@1840) and Seed-Prover (87.9% pass@1) but with significantly lower compute.
FATE-M (Abstract Algebra)	98.0%	Near-perfect performance on intermediate difficulty algebra problems.
FATE-H (Hard Algebra)	66.0%	Significantly outperforms previous attempts (which were near 0-3%).
FATE-X (Expert Algebra)	24.0%	First demonstration of solving a non-trivial portion of expert-level algebra problems.
LeanCat (Category Theory)	59.0%	Demonstrates applicability to advanced research-level formalization.

C. Cost and Efficiency

Cost: The average cost per sample was $12.60, significantly lower than systems requiring thousands of passes or massive compute clusters.
Speed: Execution time was an order of magnitude lower than the Hilbert prover.
Adaptability: Because the system does not rely on fine-tuning a specific model, it naturally adapts to new versions of Lean and Mathlib without retraining.

4. Significance and Future Directions

Democratization of ATP: AxProverBase demonstrates that high-performance theorem proving does not require massive, complex infrastructure. A simple, modular agent can achieve competitive results, making formal verification more accessible to researchers and practitioners.
Baseline for Research: The authors provide an open-source reference implementation to serve as a standard baseline. This allows the community to isolate the effects of new LLM capabilities from architectural complexity.
Shift in Paradigm: The results suggest that the field is shifting from "training specialized models" to "scaffolding general models." As frontier LLMs improve, the performance of this minimal agent will naturally rise without further training.
Future Work: The authors identify opportunities to improve specific modules, such as stronger verification (using tools like SafeVerify), better local context management for library search, and experimenting with specialized base models.

In conclusion, AxProverBase proves that a minimalist, iterative, and memory-aware agent is sufficient to achieve state-of-the-art results in automated theorem proving, offering a cost-effective and robust alternative to complex, resource-heavy systems.