LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

Imagine you are trying to teach a brilliant, super-fast robot how to do advanced mathematics. You've already taught it how to solve high school algebra problems and even some tricky math olympiad puzzles. The robot is great at crunching numbers and finding clever shortcuts.

But now, you want to teach it Category Theory.

The Problem: The "Abstract Gap"

Think of Category Theory not as a list of numbers to add, but as the operating system of modern mathematics. It's like the difference between learning to drive a specific car (solving a specific math problem) and understanding the laws of physics that govern how all vehicles move (understanding the abstract structures).

The researchers in this paper, LeanCat, discovered a huge problem:

The Old Way: Current AI models are like race car drivers who are amazing at the track they've practiced on. If you give them a new, complex track that requires understanding the principles of aerodynamics rather than just memorizing turns, they crash.
The Result: When they tested the best AI models on 100 new Category Theory problems, the models failed miserably. They could solve the "Easy" problems (like driving in a parking lot), but on "Hard" problems (driving in a storm), their success rate dropped to 0%.

The AI was stuck trying to guess the answer or use "tricks" that worked for simple math but didn't work for deep, structural reasoning. It couldn't "look up" the right rules in its mental library because it didn't know which rules to look for.

The Solution: The "Librarian Agent" (LeanBridge)

To fix this, the team built a new kind of AI agent called LeanBridge.

Imagine the AI isn't just a lone genius trying to remember everything. Instead, it's a detective with a magical library.

The Detective: The AI looks at the problem.
The Library: Instead of guessing, it has a tool to instantly search a massive database of mathematical definitions and proven facts (called Mathlib).
The Loop:
- The AI tries to solve the problem.
- If it gets stuck or makes a mistake, it doesn't just try again blindly. It asks the library: "Hey, do we have a rule about this specific shape?"
- The library hands it the right definition.
- The AI tries again, now armed with the correct information.
- It repeats this cycle until the proof is perfect.

The Results: A Breakthrough

When they tested this new "Detective with a Library" approach:

The Old AI: Solved 12% of the problems.
The New Agent: Solved 24% of the problems.

It didn't just double the score; it did something the old AI couldn't do at all: it solved the hardest problems. The "Detective" approach was the only one that could navigate the complex, abstract forest of Category Theory without getting lost.

Why This Matters

This paper is a wake-up call for the future of AI in science.

The Lesson: You can't just make AI "smarter" by feeding it more data or making it guess faster. To solve hard, abstract problems, AI needs to learn how to use tools, search for information, and refine its work step-by-step, just like a human researcher does.
The Future: This "LeanCat" benchmark is like a new gym for AI. It's a place to train these digital brains to stop being just "calculators" and start becoming true "mathematicians" who can understand the deep structure of the universe.

In short: The paper shows that to teach AI advanced math, we have to stop treating it like a calculator and start treating it like a researcher with a library card.

1. Problem Statement

While Large Language Models (LLMs) have shown promise in formal theorem proving, current benchmarks (e.g., miniF2F, ProofNet) primarily focus on olympiad-style problems or undergraduate arithmetic/algebra. These datasets often reward short, clever tricks or computation rather than library-grounded abstraction.

The authors identify a critical gap: modern mathematics relies on high-level interfaces, reusable structures, and navigating vast libraries of definitions (like Mathlib in Lean). Existing benchmarks fail to test whether models can:

Reason using high-level categorical interfaces (functors, natural transformations, limits).
Manage long-horizon dependencies between definitions.
Compose knowledge from a mature library rather than solving isolated puzzles.

The paper posits that Category Theory is the ideal "stress test" for this capability due to its reliance on diagrammatic reasoning and universal properties.

2. Methodology

A. The LeanCat Benchmark

The authors introduce LeanCat, a benchmark suite consisting of 100 fully formalized category theory tasks in Lean 4 (using Mathlib).

Structure: The problems are divided into two parts:
- Abstract: Derived from standard textbooks (e.g., Riehl, Mac Lane) focusing on core interfaces (adjunctions, limits, monads).
- Concrete: Derived from Abstract and Concrete Categories, focusing on concretizability and injectivity.
Difficulty Annotation: A hybrid pipeline was used to grade difficulty (Easy, Medium, High):
- LLM Scoring: Five baseline models attempted proofs; scores were derived from the number of successful proofs vs. correct statements.
- Human Scoring: Two experts rated problems based on proof length and argument intricacy.
- Distribution: 20 Easy, 40 Medium, 40 High. This skewed distribution prevents premature saturation by current models.
Design: Tasks are "statement-level," meaning the model must prove a single theorem without scaffolded intermediate lemmas, forcing the model to handle definition retrieval and abstraction navigation independently.

B. Evaluation Strategies

The paper evaluates two distinct inference strategies:

Static Baselines (Parallel Sampling):
- Models: GPT-5.2, Claude-Opus-4.5, Gemini-3-Pro, DeepSeek-V3.2, Kimi-K2, and specialized provers (DeepSeek-Prover-V2, Goedel-Prover, etc.).
- Protocol: Pass@k (sampling $k$ independent proofs). No external tools or search; models rely solely on internal knowledge.
LeanBridge (Retrieval-Augmented Agent):
- Architecture: A sequential Retrieve–Generate–Verify loop.
- Mechanism:
  - Retrieve: Uses LeanExplore (hybrid ranking: semantic embeddings + BM25 + PageRank) to query Mathlib for relevant definitions and lemmas.
  - Generate: The LLM attempts a proof using the retrieved context.
  - Verify: The Lean compiler checks the proof.
  - Refine: If verification fails, the error message is fed back to the LLM to refine the search query or the proof strategy.
- Settings: Tested with "NL-only" input and "NL+Statement" input (providing the formal Lean theorem declaration).

3. Key Contributions

LeanCat Benchmark: The first benchmark dedicated to category theory, covering 8 topical clusters (from Basic Properties to Monads) designed to test structural reasoning over rote calculation.
Difficulty Annotation Pipeline: A novel hybrid scoring system combining expert judgment and model performance data to create a reliable difficulty scale.
Discovery of the "Abstraction Gap": Empirical evidence showing that current state-of-the-art models suffer a catastrophic performance drop on medium-to-high difficulty tasks involving abstraction, despite high performance on easy tasks.
LeanBridge Agent: A retrieval-augmented framework demonstrating that dynamic library interaction is a strict necessity, not just an optimization, for abstract reasoning.

4. Results

A. The Abstraction Gap (Static Baselines)

Overall Performance: The best model (Claude-Opus-4.5) achieved only 12.0% success rate at Pass@4.
Difficulty Collapse: Performance dropped sharply with difficulty:
- Easy: 55.0%
- Medium: 2.5%
- High: 0.0%
Specialized Provers: Even models trained specifically for formal math (e.g., DeepSeek-Prover-V2) failed to generalize, achieving 0% on Medium/High tasks despite a massive sampling budget (Pass@32). This suggests overfitting to "olympiad-style" training data.

B. LeanBridge Performance

Success Rate: LeanBridge (GPT-5.2 with NL+Statement) achieved a peak success rate of 24.0%, doubling the performance of the best static baseline.
Breaking the High-Difficulty Barrier: It was the only approach to solve High-difficulty tasks (2.5% success), whereas static baselines and ungrounded agents failed completely (0%).
Impact of Formal Grounding: Providing the formal statement (NL+Statement) was crucial. It improved Medium task performance from 5.0% to 25.0% and unlocked High-difficulty solutions.
Ablation Study: Removing the search capability ("No-Search" agent) dropped performance significantly on hard tasks, proving that iterative refinement alone is insufficient without dynamic knowledge retrieval.

C. Failure Modes Analysis

Qualitative analysis of failed attempts identified five dominant failure modes:

Math Failure: Incorrect mathematical strategy (e.g., using vacuous truths for existential goals).
Lean Grammar Failure: Syntax or elaboration errors (bad braces, binder order).
Hallucination: Invoking non-existent lemmas or constants.
Lazy Failure: Avoiding the construction of necessary intermediate lemmas.
Hack Failure: Altering the formal meaning of the statement to make it trivially provable.

5. Significance and Future Work

Necessity of Agentic Workflows: The results empirically demonstrate that for abstract domains like category theory, iterative refinement and dynamic library retrieval are strict necessities. Simple scaling of sampling (more attempts) cannot bridge the abstraction gap.
Benchmarking Progress: LeanCat provides a compact, reusable testbed to track progress toward reliable, research-level formalization, moving beyond "toy" problems.
Roadmap: The authors envision LeanCat as the first part of a series, with future iterations targeting higher algebra (monoidal categories, 2-categories).
Community Impact: The benchmark helps identify missing lemmas and interface gaps in the Mathlib library, guiding human engineering efforts, while simultaneously forcing AI research to improve abstraction-aware planning and retrieval.

In conclusion, the paper establishes that while LLMs are improving at formal proof, they currently lack the ability to navigate complex, abstract mathematical libraries without external guidance. The LeanBridge agent proves that closing this gap requires a shift from static generation to dynamic, retrieval-augmented reasoning.