Learning to Disprove: Formal Counterexample Generation with Large Language Models

This paper addresses the gap in AI mathematical reasoning by introducing a fine-tuning framework that leverages symbolic mutation and multi-reward expert iteration to enable large language models to generate and formally verify counterexamples in Lean 4, significantly improving performance on newly established benchmarks.

Zenan Li, Zhaoyu Li, Kaiyu Yang, Xiaoxing Ma, Zhendong Su

Published 2026-03-23
📖 5 min read🧠 Deep dive

Imagine you are a master detective trying to solve a mystery. In the world of mathematics, there are two main jobs for a detective:

  1. The Prosecutor: You try to prove that a suspect is guilty (proving a statement is true).
  2. The Defense Attorney: You try to prove the suspect is innocent by finding a single piece of evidence that breaks the prosecution's case (finding a counterexample to prove a statement is false).

For a long time, Artificial Intelligence (AI) has been an amazing Prosecutor. It can build complex, logical arguments to prove math theorems. But it has been terrible at being a Defense Attorney. It struggles to say, "Wait a minute, here is one specific case where your rule doesn't work."

This paper introduces a new way to train AI to become a brilliant Defense Attorney. They call it "Learning to Disprove."

Here is how they did it, explained with some everyday analogies:

1. The Problem: The AI Has No Practice Cases

Imagine you want to teach a student how to find loopholes in a contract. You can't just give them one or two examples; they need thousands of practice cases.

  • The Issue: There were almost no "practice cases" for AI to learn how to find counterexamples. Most math data only has "correct" proofs, not "broken" ones.
  • The Analogy: It's like trying to teach a chess player how to win by only showing them games where they won, never showing them games where they lost or where the opponent made a mistake.

2. The Solution: The "Mutation Machine" (Data Synthesis)

Since they didn't have enough practice cases, the researchers built a machine to create them. They call this Symbolic Mutation.

  • How it works: They took thousands of math problems that were already proven to be True. Then, they used a computer program to "mutate" them.
  • The Analogy: Imagine a perfect cake recipe (Theorem).
    • Original: "If you use flour, sugar, and eggs, you get a cake." (True)
    • The Mutation: The computer secretly removes "eggs" from the instructions.
    • The New Problem: "If you use flour and sugar, you get a cake."
    • The Result: This new statement is False. To prove it's false, you need to show a "counterexample": a bowl of flour and sugar that doesn't turn into a cake.
  • The Magic: By doing this automatically, they generated 575,000 new "broken" problems for the AI to practice on. This gave the AI a massive library of "loophole hunting" exercises.

3. The Training: The "Double-Check" System

Training an AI is like teaching a dog a trick. If the dog fails, you usually just say "No" (no reward). But if the dog is trying a hard trick, it might get discouraged and stop trying. This is called the Sparse Reward Problem.

The researchers invented a Multi-Reward System to keep the AI motivated.

  • The Old Way: The AI tries to find a counterexample. If it fails, it gets zero points. If it succeeds, it gets a point.
  • The New Way (Multi-Reward):
    1. Reward A: Did you find a valid counterexample? (Did the cake fail to rise?)
    2. Reward B: Did you prove why the missing ingredient was the problem? (Did you prove that "eggs" were the missing link?)
  • The Analogy: Even if the AI can't solve the hardest puzzle, if it can at least prove why the puzzle is broken, it still gets a small treat. This keeps the AI learning even when the problems are very difficult.

4. The Process: "Guess and Check"

The AI doesn't just guess randomly. It follows a two-step process, like a human mathematician:

  1. The "Gut Check" (Informal Reasoning): The AI uses its natural language brain to say, "Hmm, if I remove this rule, maybe I can make a sequence of numbers that breaks the pattern." It comes up with a rough idea.
  2. The "Courtroom" (Formal Proof): The AI then has to write that idea in a strict, computer-readable language (Lean 4). A computer verifier checks the code. If the code runs without errors, the counterexample is officially accepted.

The Results: A New Champion

When they tested this new AI against the best existing math AIs:

  • It found counterexamples 47% to 74% better than the competition.
  • It became much better at spotting errors in other people's math proofs.
  • It learned that sometimes, the best way to understand a rule is to try to break it.

Why Does This Matter?

In the real world, we don't just want AI that can follow rules; we want AI that can critique them.

  • In Science: If a scientist proposes a new theory, an AI that can find the "edge cases" where the theory fails is invaluable.
  • In Safety: If we use AI to write code for self-driving cars, we need an AI that can find the one specific situation where the car's logic might fail, rather than just proving it works 99% of the time.

In short: This paper taught AI how to be a better skeptic. By generating millions of "broken" math problems and rewarding the AI for finding the cracks in the logic, they created a smarter, more self-aware mathematical mind.