Hilbert: Recursively Building Formal Proofs with Informal Reasoning

Imagine you are trying to solve a incredibly difficult math puzzle, like the kind found in the world's toughest high school or college competitions. You have two assistants to help you:

The "Big Thinker" (General AI): This assistant is brilliant at understanding the problem, explaining the logic in plain English, and coming up with a clever strategy. However, they are a bit messy. They might make small calculation errors, skip a step, or accidentally use a rule that doesn't apply. They are great at the idea but bad at the execution.
The "Strict Auditor" (Formal Prover): This assistant is a robot that speaks a very rigid, computer-readable language (Lean 4). They are perfect. If they say a proof is correct, it is 100% correct. But they are also very narrow-minded. If you give them a problem that is too complex or doesn't look exactly like a textbook example, they get stuck and give up immediately.

For a long time, researchers had to choose between the Big Thinker (who gets the right idea but fails the test) and the Strict Auditor (who passes the test but can't even start the hard problems).

Enter HILBERT.

HILBERT is a new system that acts like a Master Project Manager who knows how to get the best out of both assistants. It bridges the gap between "thinking" and "proving."

How HILBERT Works: The "Lego Tower" Analogy

Imagine you need to build a massive, 100-story Lego tower (the final proof).

The Old Way: You ask the Strict Auditor to build the whole tower at once. They look at the 100 stories, get overwhelmed, and say, "I can't do this."
The HILBERT Way: HILBERT breaks the problem down into tiny, manageable pieces.

Here is the step-by-step process HILBERT uses:

1. The Strategy Session (The Big Thinker)

HILBERT first asks the Big Thinker to look at the 100-story tower. The Big Thinker says, "Okay, we can't build this all at once. Let's break it down! We need a foundation, then a 10-story section, then another 10-story section..."
The Big Thinker writes a blueprint (a "proof sketch") in plain English, dividing the huge problem into smaller sub-problems.

2. The Retrieval (The Librarian)

Before trying to build, HILBERT sends a Librarian to the library (Mathlib, a giant database of known math facts). The Librarian finds specific rules and theorems that are perfect for building the foundation or the 10-story sections. This ensures the team isn't reinventing the wheel.

3. The Construction (The Strict Auditor)

Now, HILBERT takes one small section of the blueprint (say, "Build the foundation") and hands it to the Strict Auditor.

If the Auditor succeeds: Great! That piece is done.
If the Auditor fails: The Auditor says, "I can't build this specific 10-story section."

4. The Recursive Loop (The "Divide and Conquer" Magic)

This is where HILBERT gets really smart. Instead of giving up, it goes back to the Big Thinker.

"Hey, the Auditor couldn't build this 10-story section. Can you break that down into even smaller pieces?"
The Big Thinker breaks the 10-story section into five 2-story sections.
HILBERT hands these tiny 2-story sections back to the Strict Auditor.
The Auditor, now dealing with tiny, simple tasks, can easily build them.

This process is recursive. If a 2-story section is still too hard, HILBERT breaks it down again into single bricks. It keeps digging deeper until the problem is so small that the Strict Auditor can solve it instantly.

5. The Assembly

Once all the tiny pieces (bricks, 2-story sections, 10-story sections) are built and verified as perfect, HILBERT snaps them all together. Because every single piece was checked by the Strict Auditor, the final 100-story tower is guaranteed to be mathematically perfect.

Why This is a Big Deal

It solves the "Hard Problem" gap: Before HILBERT, the best computer programs could solve about 13% of the hardest math competition problems (Putnam). HILBERT solved 70%. It even beat some of the most expensive, secret AI systems used by big tech companies.
It's efficient: By breaking problems down, it doesn't waste energy trying to force the Strict Auditor to do impossible tasks. It uses the Big Thinker to simplify the task first.
It's self-correcting: If the Big Thinker makes a mistake in the blueprint, the Strict Auditor catches it immediately, and HILBERT asks the Big Thinker to fix the plan before moving on.

The Bottom Line

HILBERT is like a construction crew where the Architect (Big Thinker) designs the plan and breaks it down, and the Mason (Strict Auditor) lays every single brick with perfect precision. By working together in a loop, they can build structures that neither could build alone.

The result? A system that can generate mathematically perfect proofs for problems that were previously considered too difficult for computers to solve on their own.

1. Problem Statement

While Large Language Models (LLMs) have demonstrated impressive capabilities in informal mathematical reasoning (e.g., solving Olympiad problems), their solutions often contain hallucinations, logical fallacies, and calculation errors that cannot be automatically verified. Conversely, specialized Formal Theorem Provers (e.g., those trained on Lean 4) can generate mathematically rigorous, machine-verifiable proofs but struggle to solve complex problems that general-purpose LLMs can solve informally.

There exists a significant performance gap:

General Reasoners: Excel at decomposition and strategy but fail at full formal synthesis (e.g., ~49% pass rate on miniF2F with massive attempts).
Specialized Provers: Excel at syntax and tactics but lack the high-level reasoning to decompose complex theorems or recover from errors (e.g., ~13% pass rate on PutnamBench).

Current agentic frameworks often rely on shallow, single-layer decomposition or fail to effectively orchestrate the strengths of both informal reasoning and formal verification.

2. Methodology: The HILBERT Framework

HILBERT is a multi-agent framework designed to bridge this gap by orchestrating four key components:

Informal Reasoner: A general-purpose LLM (e.g., Gemini 2.5 Pro, gpt-oss-120b) capable of high-level mathematical reasoning and sketching.
Specialized Prover: A Lean 4-optimized LLM (e.g., Goedel-Prover-V2, DeepSeek-Prover-V2) for generating formal tactics.
Formal Verifier: A Lean 4 compiler/server (Kimina Lean Server) to check proof correctness.
Semantic Theorem Retriever: A search engine using sentence transformers to retrieve relevant theorems from the Mathlib library.

Core Algorithm: Recursive Subgoal Decomposition

The system operates via a hierarchical, recursive process (Algorithm 1 & 2):

Direct Attempt: The system first attempts to prove the theorem directly using the Prover.
Subgoal Decomposition (if direct proof fails):
- Retrieval: The Reasoner generates search queries to retrieve relevant theorems from Mathlib.
- Sketch Generation: The Reasoner writes a detailed informal proof and converts it into a Lean 4 proof sketch. This sketch breaks the main theorem into smaller subgoals using have statements, initially filled with sorry placeholders.
- Extraction: The system extracts these subgoals as independent theorem statements.
- Assembly: The Reasoner assembles the final proof structure by linking the subgoals.
Subgoal Verification (Recursive Loop): For each extracted subgoal, the system attempts to solve it using a three-tiered strategy:
- Tier 1 (Direct Prover): Attempt to prove the subgoal directly with the Prover.
- Tier 2 (Shallow Solve): If the Prover fails, the Reasoner attempts a "shallow solve" (writing a short formal proof) augmented with retrieved theorems. This includes an error-correction loop where the Verifier's feedback guides the Reasoner to fix compilation errors.
- Tier 3 (Recursive Decomposition): If the subgoal remains unsolved, the system recursively decomposes the subgoal into smaller sub-subgoals (up to a maximum depth $D$ ).
Final Assembly: Once all subgoals are proven (or the recursion depth is exhausted), the system stitches the proofs together to form a complete, verified proof of the original theorem.

Key Technical Nuances:

Error Correction: The system leverages the Reasoner to interpret Lean compilation errors and suggest corrections, preventing the "silent failure" common in pure LLM approaches.
Type Safety: The prompts explicitly instruct the models to avoid natural number arithmetic pitfalls (e.g., division/subtraction on Nat types) by casting to appropriate types (Z, Q, R).
Parallelization: The framework uses AsyncJobPool to parallelize attempts across subgoals and proof attempts, optimizing inference-time compute.

3. Key Contributions

Bridging the Gap: HILBERT successfully combines the strategic reasoning of general LLMs with the rigorous verification of specialized provers, closing the performance gap between the two paradigms.
Recursive Decomposition: Unlike previous methods that perform single-layer decomposition, HILBERT employs a recursive strategy that breaks down difficult subgoals further, allowing it to tackle problems too complex for a single-step approach.
Retrieval-Augmented Generation: The integration of a semantic theorem retriever significantly reduces the search space for the Reasoner and Prover, improving both accuracy and efficiency.
State-of-the-Art Performance: The framework achieves unprecedented results on formal theorem proving benchmarks, outperforming both open-source and proprietary baselines.

4. Experimental Results

MiniF2F Benchmark (High School/Undergraduate Olympiad)

Performance: HILBERT achieved a 99.2% pass rate on the MiniF2F test set (using Gemini 2.5 Pro + Goedel-Prover-V2-32B).
Comparison: This is 6.6 percentage points higher than the best publicly available method (Delta Prover at 95.9%) and significantly outperforms proprietary systems like SeedProver (99.6% is noted as a proprietary benchmark, but HILBERT's 99.2% is the highest publicly available result).
Efficiency: HILBERT achieved these results with fewer total LLM calls (approx. 11.3K) compared to Delta Prover's 16,384 attempts.

PutnamBench (Undergraduate Competition)

Performance: HILBERT solved 462 out of 660 problems (70.0%).
Comparison:
- Outperformed the proprietary SeedProver (50.4%) by nearly 20 percentage points.
- Achieved a 422% improvement over the best open-source baseline (Goedel-Prover-V2-32B with self-correction at ~13.4%).
- This represents the strongest known result from a publicly available model on this challenging dataset.

Ablation Studies

Recursive Depth: Performance increases monotonically with recursion depth ( $D$ ). The full system reaches near-optimal performance at shallow depths ( $D=3$ ), whereas disabling "shallow solve" requires much deeper recursion to match performance.
Retrieval: Enabling retrieval improved pass rates (e.g., 99.2% vs 97.9% for Goedel Prover) and significantly reduced inference-time compute (fewer calls and tokens) by surfacing relevant theorems early.

5. Significance and Impact

Scalability: HILBERT demonstrates that agentic frameworks can scale beyond the context limits of individual LLMs. By breaking problems into recursive subgoals, the system can generate proofs exceeding 15,000 lines of code (PutnamBench), a feat impossible for standard LLMs due to context window constraints.
Paradigm Shift: The paper suggests that the future of formal theorem proving lies not just in training larger "prover" models, but in orchestrating multi-agent systems that leverage the complementary strengths of general reasoning and specialized verification.
Virtuous Cycle: The authors propose that the high-quality proofs and reasoning traces generated by HILBERT can be used to train even better Prover and Reasoner models, creating a self-improving loop for mathematical AI.

In conclusion, HILBERT represents a major leap forward in automated theorem proving, effectively solving complex mathematical problems that were previously out of reach for open-source models and surpassing proprietary systems in specific benchmarks.