Learning to Disprove: Formal Counterexample Generation with Large Language Models

Imagine you are a master detective trying to solve a mystery. In the world of mathematics, there are two main jobs for a detective:

The Prosecutor: You try to prove that a suspect is guilty (proving a statement is true).
The Defense Attorney: You try to prove the suspect is innocent by finding a single piece of evidence that breaks the prosecution's case (finding a counterexample to prove a statement is false).

For a long time, Artificial Intelligence (AI) has been an amazing Prosecutor. It can build complex, logical arguments to prove math theorems. But it has been terrible at being a Defense Attorney. It struggles to say, "Wait a minute, here is one specific case where your rule doesn't work."

This paper introduces a new way to train AI to become a brilliant Defense Attorney. They call it "Learning to Disprove."

Here is how they did it, explained with some everyday analogies:

1. The Problem: The AI Has No Practice Cases

Imagine you want to teach a student how to find loopholes in a contract. You can't just give them one or two examples; they need thousands of practice cases.

The Issue: There were almost no "practice cases" for AI to learn how to find counterexamples. Most math data only has "correct" proofs, not "broken" ones.
The Analogy: It's like trying to teach a chess player how to win by only showing them games where they won, never showing them games where they lost or where the opponent made a mistake.

2. The Solution: The "Mutation Machine" (Data Synthesis)

Since they didn't have enough practice cases, the researchers built a machine to create them. They call this Symbolic Mutation.

How it works: They took thousands of math problems that were already proven to be True. Then, they used a computer program to "mutate" them.
The Analogy: Imagine a perfect cake recipe (Theorem).
- Original: "If you use flour, sugar, and eggs, you get a cake." (True)
- The Mutation: The computer secretly removes "eggs" from the instructions.
- The New Problem: "If you use flour and sugar, you get a cake."
- The Result: This new statement is False. To prove it's false, you need to show a "counterexample": a bowl of flour and sugar that doesn't turn into a cake.
The Magic: By doing this automatically, they generated 575,000 new "broken" problems for the AI to practice on. This gave the AI a massive library of "loophole hunting" exercises.

3. The Training: The "Double-Check" System

Training an AI is like teaching a dog a trick. If the dog fails, you usually just say "No" (no reward). But if the dog is trying a hard trick, it might get discouraged and stop trying. This is called the Sparse Reward Problem.

The researchers invented a Multi-Reward System to keep the AI motivated.

The Old Way: The AI tries to find a counterexample. If it fails, it gets zero points. If it succeeds, it gets a point.
The New Way (Multi-Reward):
1. Reward A: Did you find a valid counterexample? (Did the cake fail to rise?)
2. Reward B: Did you prove why the missing ingredient was the problem? (Did you prove that "eggs" were the missing link?)
The Analogy: Even if the AI can't solve the hardest puzzle, if it can at least prove why the puzzle is broken, it still gets a small treat. This keeps the AI learning even when the problems are very difficult.

4. The Process: "Guess and Check"

The AI doesn't just guess randomly. It follows a two-step process, like a human mathematician:

The "Gut Check" (Informal Reasoning): The AI uses its natural language brain to say, "Hmm, if I remove this rule, maybe I can make a sequence of numbers that breaks the pattern." It comes up with a rough idea.
The "Courtroom" (Formal Proof): The AI then has to write that idea in a strict, computer-readable language (Lean 4). A computer verifier checks the code. If the code runs without errors, the counterexample is officially accepted.

The Results: A New Champion

When they tested this new AI against the best existing math AIs:

It found counterexamples 47% to 74% better than the competition.
It became much better at spotting errors in other people's math proofs.
It learned that sometimes, the best way to understand a rule is to try to break it.

Why Does This Matter?

In the real world, we don't just want AI that can follow rules; we want AI that can critique them.

In Science: If a scientist proposes a new theory, an AI that can find the "edge cases" where the theory fails is invaluable.
In Safety: If we use AI to write code for self-driving cars, we need an AI that can find the one specific situation where the car's logic might fail, rather than just proving it works 99% of the time.

In short: This paper taught AI how to be a better skeptic. By generating millions of "broken" math problems and rewarding the AI for finding the cracks in the logic, they created a smarter, more self-aware mathematical mind.

1. Problem Statement

The paper addresses a critical gap in AI-driven mathematical reasoning: while Large Language Models (LLMs) have made significant strides in constructing proofs for true statements, they largely neglect the equally important task of disproving false statements via counterexamples.

Current challenges in formal counterexample generation include:

Data Scarcity: Existing datasets (e.g., CounterMath) are small (approx. 1,200 problems) and lack formal verification, making them insufficient for training robust LLMs.
Sparse Reward Signals: In reinforcement learning or expert iteration, if an LLM fails to find a counterexample for a complex problem, the reward signal vanishes (is zero). This leads to training plateaus where models cannot learn from failures on hard problems.
Verification Difficulty: Unlike standard theorem proving, finding a counterexample requires a "guess-and-check" paradigm where the model must propose a specific instance and then formally prove its validity, a process complicated by the scarcity of training data for existential theorems ( $\exists x, \neg P(x)$ ).

2. Methodology

The authors propose an integrated framework consisting of two main stages: Symbolic Data Synthesis and Multi-Reward Guided Training.

A. Symbolic Mutation Strategy (Data Synthesis)

To overcome data scarcity, the authors introduce a systematic method to generate massive amounts of counterexample training data from existing provable theorems.

Seed Collection: They collect formally provable theorems from sources like Mathlib, Leanworkbook, MiniF2F, and PutnamBench.
Hypothesis Dropping: For a provable theorem of the form $\forall x, (H_1(x) \land H_2(x)) \to C(x)$ , the system uses the Lean 4 prover to identify and discard a necessary hypothesis (e.g., $H_1$ ).
Mutation: This creates a new, invalid theorem: $\exists x, (H_2(x) \to C(x))$ . Since $H_1$ was necessary, this new statement is false, implying the existence of a counterexample.
Dual Problem Generation: The mutation process generates two related problems for training:
- Mutated Version: The target counterexample problem ( $\exists x, H_2(x) \to C(x)$ ).
- Dropped Hypothesis: A proof that the dropped hypothesis is indeed false for the counterexample ( $\exists x, \neg H_1(x)$ ).
Scale: This strategy synthesized 575,000 counterexample instances, vastly expanding the available training data.

B. Multi-Reward Guided Training (Expert Iteration)

To address the sparse reward problem, the authors design a multi-reward function within an expert iteration framework.

Two-Stage Process:
- Informal Reasoning: An LLM ( $q_\phi$ ) proposes a concrete counterexample ( $x^*$ ) based on the mutated problem.
- Formal Proof Generation: A second LLM ( $q_\psi$ $q_{ψ}$ ) generates formal Lean 4 proofs for two theorems using $x^*$ $x^{*}$ :
  1. The Mutated Version (proving the counterexample works).
  2. The Dropped Hypothesis (proving the counterexample violates the removed condition).
Dual Rewards:
- $r_M$ : Reward for successfully proving the mutated version.
- $r_H$ : Reward for successfully proving the dropped hypothesis.
- Total Reward: $r = \alpha r_M + (1-\alpha) r_H$ .
Advantage: Even if the model fails to prove the complex mutated version ( $r_M=0$ ), it can still receive a reward ( $r_H > 0$ ) if it correctly identifies that the dropped hypothesis is false. This ensures dense reward signals, allowing the model to learn from partial successes and preventing training plateaus.
Fine-Tuning: The models are updated via Supervised Fine-Tuning (SFT) using these weighted datasets.

3. Key Contributions

Formal Counterexample Framework: The first comprehensive framework dedicated to training LLMs to generate formal counterexamples verified by the Lean 4 theorem prover, moving beyond natural language heuristics.
Symbolic Mutation Strategy: A novel, rigorous data synthesis technique that automatically generates diverse, high-quality counterexample problems by systematically invalidating provable theorems.
Multi-Reward Training Mechanism: A solution to the sparse reward problem in counterexample generation by leveraging the logical relationship between the counterexample and the dropped hypothesis, ensuring effective learning even on difficult tasks.
Large-Scale Dataset: The creation of a 575K-instance dataset for counterexample generation, significantly larger than any previous benchmark.

4. Experimental Results

The framework was evaluated on three newly constructed benchmarks:

FOR-COUNTER: 1,058 formalized counterexample problems.
VERI-REASON: Verification of reasoning steps in theorem proving.
VERI-FORMALIZE: Verification of autoformalized results.

Key Findings:

Performance Gains: The fine-tuned model achieved a 47% to 74% relative improvement in pass@1 success rate compared to the strongest baselines (including proprietary models like GPT-4.1 and DeepSeek-R1, and open-source provers like Goedel and Leanabell).
Convergence: Multi-reward training converged faster and achieved higher final performance (approx. 49% pass@1) compared to single-reward training (43% pass@1).
Data Efficiency: The mutation strategy successfully generated diverse data with a mutation ratio between 1.65 and 2.48, with low computational overhead (0.3–0.7s per theorem).
Ablation: Both the informal reasoning model (counterexample proposer) and the formal reasoning model (proof generator) contributed significantly, with the combined fine-tuned system outperforming all individual components.

5. Significance

Enhanced Reasoning Capabilities: By training models to "disprove," the paper enhances the self-verification and self-correction capabilities of LLMs, a crucial step toward reliable mathematical AI.
Bridging the Gap: It bridges the gap between informal mathematical intuition (guessing counterexamples) and formal verification, creating a robust pipeline for validating mathematical conjectures.
Practical Application: The framework provides a practical "copilot" for mathematicians to test conjectures and identify edge cases, potentially accelerating mathematical discovery.
Future Direction: It establishes a new paradigm for training LLMs on existential problems, suggesting that multi-reward strategies and symbolic data synthesis are essential for overcoming the limitations of current reasoning models.

In summary, this work demonstrates that by combining symbolic data augmentation with multi-reward learning, LLMs can be effectively trained to perform the difficult task of formal counterexample generation, significantly outperforming current state-of-the-art models in mathematical reasoning and verification.