Discovering New Theorems via LLMs with In-Context Proof… — Plain-Language Explanation

Original authors: Kazumi Kasaura, Naoto Onda, Yuta Oriike, Masaya Taniguchi, Akiyoshi Sannai, Sho Sonoda

Published 2026-05-07

📖 4 min read☕ Coffee break read

Original authors: Kazumi Kasaura, Naoto Onda, Yuta Oriike, Masaya Taniguchi, Akiyoshi Sannai, Sho Sonoda

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a very intelligent but slightly forgetful robot to solve complex mathematical puzzles. The robot is a Large Language Model (LLM), and the puzzles are formal mathematical proofs written in a strict computer language called Lean.

This work introduces a new method for teaching this robot, called the Conjecturing-Proving Loop (CPL). Here is how it works, explained through simple analogies:

The Problem: The "Guess-and-Check" Trap

Normally, when people try to teach AI mathematics, they ask it to guess a puzzle and solve it immediately.

The Analogy: Imagine asking a student to "write a math problem and solve it right away."
The Problem: The student becomes lazy. They write easy problems (like "2 + 2 = 4") because these are easy to solve. They avoid difficult problems because they know they might fail. In the end, the AI generates thousands of simple, boring proofs and misses the difficult, interesting ones.

The Solution: The "Two-Step Dance" (CPL)

The authors split the process into two distinct roles: a Conjecturer (the idea generator) and a Prover (the solver).

The Conjecturer (The Architect): This part of the AI looks at a library of existing mathematical rules and develops new ideas (conjectures). It does not try to solve them yet; it simply writes them down.
The Prover (The Builder): This part takes the ideas and attempts to create a proof for them. If it fails, it tries again. It keeps trying until it either succeeds or runs out of attempts.
The Library (The Memory): Every time the Prover successfully creates a proof, that proof is added to the library.

The Magical Ingredient: Context Learning
Here comes the clever part: The Prover does not just look at the original mathematical rules. It looks at the library of proofs it has already successfully created during the current session.

The Analogy: Imagine a student taking an exam. In the old way, they had to rely only on what they memorized before the exam started. In this new way, every time the student correctly solves a problem, they are allowed to read their own solution before tackling the next problem. They learn the "tricks" and "strategies" from their own recent successes.

What They Found

The researchers tested this on some tricky topological concepts (a branch of mathematics dealing with shapes and spaces) that the AI did not yet master well.

Quantity vs. Quality: The old method (simultaneous guessing and solving) generated more theorems overall, but they were mostly short and simple. The new method (CPL) generated fewer theorems overall, but they were much more difficult and longer.
The Big Win: The new method successfully discovered a specific, difficult theorem about "alpha-open sets" that the old method never found, even after 20 attempts.
Learning from Success: When the AI received its own library of previous proofs as a "cheat sheet" (context), it could prove difficult theorems that it could not solve without this context. Even if the AI could not prove the theorem in plain English, it could prove it in Lean code once it had seen similar successful proofs.

The Conclusion

The work claims that by separating "idea generation" from "proof solving" and by letting the AI learn from its own verified successes in real time, we can enable it to discover more difficult, complex mathematical truths that it would otherwise miss. It is like giving the AI a head start by allowing it to study its own homework before taking the final exam.

Note: The work focuses strictly on this method for generating and verifying mathematical theorems. It does not claim that this method works for medical diagnoses, financial forecasts, or other real-world applications outside of formal mathematics.

Technical Conclusion: Discovery of New Theorems via LLMs with Context-Based Proof Learning in Lean

Problem Statement
Large Language Models (LLMs) have demonstrated promising results in the field of formal theorem proving but face significant challenges: they can produce hallucinations, and the simultaneous generation of both a mathematical conjecture and its proof often leads to convergence on trivial or simple theorems. Existing approaches typically rely on Supervised Fine-Tuning (SFT) or Reinforcement Learning with Verified Rewards (RLVR), which require extensive training data and are difficult to apply to closed models. Furthermore, current methods often struggle to discover "hard-to-prove" theorems because the probability of generating a theorem is heavily weighted by the immediate success rate of its proof, causing the search to collapse into simple, short proofs.

Methodology: The Conjecturing-Proving Loop (CPL)
The authors propose the Conjecturing-Proving Loop (CPL), a pipeline designed to automatically generate mathematical conjectures and verify them in Lean 4. The framework decouples conjecture generation from proof generation and utilizes a library of previously verified theorems as context for both stages.

The pipeline operates through four main components: a Conjecturer module (LLM Agent), a Prover module (LLM Agent), a Lean server, and a library (Lean code data).

Conjecture Phase: The Conjecturer generates new mathematical statements in Lean 4 format based on the current library. It queries the Lean server to ensure syntactic validity and novelty (ensuring the statement is not already provable by existing theorems in Mathlib4 or the current library).
Proof Phase: For each valid conjecture, the Prover attempts to construct a formal proof. Crucially, the Prover is provided with the library (containing previously verified theorems and proofs) as context. This enables the LLM to learn proof strategies through context-based learning (in-context learning) without retraining the model. The Prover iterates up to a maximum number of attempts (set to 16 in the experiments), utilizing Lean server error messages to refine its attempts.
Iteration: Verified pairs of conjectures and proofs are added to the library, which then serves as context for subsequent iterations.

This separation allows the system to allocate search resources based on proof difficulty. In contrast to a simple loop (SL), where a statement and a proof are generated simultaneously, CPL attempts multiple proofs for a single statement before discarding it. This shifts the distribution of generated theorems toward those that are provable but difficult, rather than those that are merely easy to prove.

Main Contributions

Pipeline Proposal: The introduction of CPL, a framework that decouples conjecture generation from proof generation, enabling the discovery of longer, more complex proofs.
Context-Based Learning for Closed Models: Demonstration that closed LLMs (specifically ChatGPT-o3) can improve their proof capabilities through context-based learning from their own previously verified outputs, thereby eliminating the need for parameter updates or fine-tuning.
Theoretical and Empirical Validation: The work provides a theoretical model showing that CPL increases the probability of generating hard-to-prove theorems compared to simultaneous generation frameworks. Experimentally, it is confirmed that CPL successfully rediscovered a specific research-level theorem that the baseline framework could not find.

Experimental Results
The authors evaluated CPL against a baseline with a simple loop (SL) using topological concepts (semi-open sets, $\alpha$ -openness, and preopenness) defined within Mathlib but not yet incorporated into the library. The target was the theorem stating that the intersection of two $\alpha$ -open sets is $\alpha$ -open.

Discovery Rate: In 20 experimental runs, CPL discovered the target theorem 5 times. In contrast, the SL framework, which generated significantly more theorems on average (328 vs. 106), failed to generate the target theorem even once. An exact Fisher test confirmed that this difference is statistically significant ( $p = 0.024$ ).
Proof Length: CPL generated theorems with significantly longer proof lengths (in character count) compared to SL, supporting the theoretical claim that the framework shifts focus toward more difficult proofs.
Effectiveness of Context:
- Re-proving: When re-proving generated theorems, providing the library as context increased the success rate from 91% to 99% ( $p = 4 \times 10^{-35}$ ).
- Target Theorem: When attempting to re-prove the target theorem regarding the intersection of $\alpha$ -open sets, the Prover succeeded 7 times out of 80 attempts when the generated library was provided as context. Without the library, it failed 100% of the time.
- Baseline in Natural Language: When ChatGPT-4o was asked to prove the theorem in natural language, it frequently evaluated the theorem as false or provided incorrect proofs; ChatGPT-o3 consistently evaluated it as false, indicating that the theorem lay outside the pre-trained knowledge of the models. The success in Lean 4 was attributed to context-based learning of proof strategies from the generated library.

Significance and Claims
The work claims that CPL effectively addresses the limitation of LLMs in discovering non-trivial theorems by leveraging context-based learning from self-generated, verified proofs. The authors emphasize that this approach enables the automatic extension of formal mathematical libraries (such as Mathlib) by generating propositions for given concepts that may not be explicitly known to the LLM. The work suggests that separating the conjecture and proof phases, combined with iterative context enrichment, constitutes a viable strategy for neural theorem proving, particularly for closed models where traditional training methods are not applicable. The authors maintain a modest stance, noting that while the framework successfully rediscovered a known research-level theorem, future work is required to refine the generation process for deeper and more significant mathematical statements.

Discovering New Theorems via LLMs with In-Context Proof Learning in Lean

The Problem: The "Guess-and-Check" Trap

The Solution: The "Two-Step Dance" (CPL)

What They Found

The Conclusion

Technical Conclusion: Discovery of New Theorems via LLMs with Context-Based Proof Learning in Lean

More like this