SorryDB: Can AI Provers Complete Real-World Lean Theorems?

The paper introduces SorryDB, a dynamic benchmark of real-world Lean formalization tasks designed to better align AI provers with community needs and mitigate test contamination, revealing that current approaches including agentic models, general LLMs, and specialized provers are complementary rather than one strictly dominating the others.

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman

Published 2026-03-04
📖 5 min read🧠 Deep dive

The "Sorry" Database: Teaching AI to Finish Real Math Homework

Imagine you are a brilliant but slightly overwhelmed math teacher. You have 78 different notebooks (GitHub repositories) filled with complex math problems. You've written down the questions and the setup for the proofs, but you haven't finished the answers yet. Instead of leaving them blank, you've stuck a little sticky note on the unfinished parts that says "Sorry, I'll do this later."

In the world of the computer language Lean (used for formal math), this sticky note is literally the word sorry.

This paper introduces SorryDB, a new way to test Artificial Intelligence. Instead of asking AI to solve old, well-known math puzzles (like Olympic math problems), the researchers built a database of these "Sorry" sticky notes from real, active math projects. They want to see if AI can actually help mathematicians finish their real-world work.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Olympiad" Trap

For years, we tested AI math brains using Math Olympiads (like the Putnam competition).

  • The Analogy: Imagine testing a race car driver only on a perfectly smooth, empty racetrack with no traffic.
  • The Issue: Real math isn't a racetrack. It's like driving through a messy city with construction, traffic jams, and weird detours. Real math projects depend on thousands of other definitions and theorems written by other people.
  • The Result: AI models got really good at the "racetrack" (Olympiads) but struggled when asked to fix a real, messy math project. Also, the AI might have just memorized the answers to the old Olympiad problems, so we didn't know if they were actually thinking.

2. The Solution: The "Sorry" Database

The researchers went to GitHub and found 78 active math projects. They looked for every single place where a human mathematician said, "I'm stuck, I'll finish this later" (the sorry keyword).

  • The Analogy: Instead of giving the AI a fresh, clean test paper, they handed it a pile of half-finished homework assignments from real students.
  • Why it's better:
    • Fresh: These problems haven't been solved yet, so the AI can't cheat by memorizing the answer.
    • Real: The problems are messy, depend on specific libraries, and require understanding the context of the whole project.
    • Moving Target: As soon as an AI solves a "Sorry," the database updates with new, harder ones. It's a benchmark that never gets "saturated" (stale).

3. The Experiment: Who Can Finish the Homework?

The researchers took a snapshot of 1,000 of these "Sorry" tasks and asked different types of AI to solve them. They tested three main types of "students":

  • The "Tactics" (The Calculator): These are pre-programmed commands (like grind or linarith) that can solve simple, mechanical math problems instantly.
    • Result: Good for easy stuff, but can't handle complex reasoning.
  • The "Generalists" (The Smart Librarian): These are huge AI models (like Gemini, Claude, GPT) that know a little bit about everything. They try to write the proof in one go.
    • Result: They are decent, but often get lost in the details or hallucinate (make up) facts.
  • The "Agents" (The Detective): These are AI models that don't just guess; they have a workflow. They can look up information, try a solution, see if it fails, read the error message, and try again.
    • Result: The Winners. The "Agent" approach (specifically one based on Gemini Flash) was the best.

4. Key Discoveries (The "Aha!" Moments)

A. Feedback is the Secret Sauce
The most successful AI didn't just guess; it iterated.

  • The Analogy: Imagine trying to fix a broken toaster.
    • One-shot AI: Guesses a fix, tries it, and if it fails, it gives up.
    • Agent AI: Guesses a fix, tries it, sees the smoke, reads the error ("Oh, the wire is too short"), and tries again with a longer wire.
  • Finding: AI that could "read the error message" and try again (Self-Correcting) solved 30% of the problems, while AI that just guessed once only solved 10%.

B. No One is Perfect (Complementarity)
No single AI model solved everything.

  • The Analogy: Think of a sports team. The "Tactics" are the goalkeepers (great at stopping simple shots). The "Generalists" are the strikers (good at big plays). The "Agents" are the midfielders (good at connecting the play).
  • Finding: If you combine all the AI models, they solved 35% of the problems. If you only used the best single model, you'd miss out on the problems the others solved. They are all needed.

C. The "Tool" Trap
Some AI models tried to use a search tool to find the answer, but they got distracted.

  • The Analogy: A student trying to do math who keeps opening Google instead of thinking.
  • Finding: Sometimes, the AI that didn't use a search tool did better because it actually tried to construct the proof from scratch, rather than hoping to find a pre-made answer.

5. The Conclusion: Why This Matters

The paper argues that to build AI that actually helps mathematicians, we need to stop testing them on "fake" clean problems and start testing them on "real" messy work.

SorryDB is like a gym for AI mathematicians. It ensures that as AI gets smarter, the tests get harder, preventing the AI from just memorizing answers. It shows that the future of AI math isn't just about having a bigger brain; it's about having a better workflow—one that can read errors, search for help, and keep trying until the "Sorry" note is finally removed.

In short: We are moving from asking AI "Can you solve this riddle?" to asking "Can you help me finish this real project?" And the answer is getting closer to "Yes," especially if the AI is allowed to learn from its mistakes.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →