LLM2SMT: Building an SMT Solver with Zero Human-Written Code

Imagine you are the CEO of a software company. Instead of hiring a team of engineers to build a new product from scratch, you hire a single, incredibly talented, but slightly eccentric AI apprentice.

Your goal? To build a Logic Detective (called an SMT Solver). This detective's job is to look at a complex set of rules and clues and answer one simple question: "Is it possible for all these clues to be true at the same time, or is there a contradiction?"

Here is the twist: You, the human, wrote zero lines of code. You didn't type a single if statement or for loop. You only gave the AI a set of instructions, and it built the entire detective agency, the filing system, and the reasoning engine itself.

The Mission: Building a Logic Machine

The researchers (Mikoláš Janota and Mirek Olšák) wanted to see if an AI could build a tool that does reasoning. They gave their AI apprentice a mission: "Build a Logic Detective that can solve puzzles involving Uninterpreted Functions (a fancy way of saying 'mystery boxes' where we know the rules of equality but not the specific contents)."

They told the AI:

The Blueprint: "Use a specific algorithm called 'Congruence Closure' (think of it as a way to group identical items together)."
The Tools: "Use this specific library for the heavy lifting (CaDiCaL) and this language for the final report (Lean)."
The Language: "Speak in C++20."

The Journey: From Chaos to Competence

1. The "Naive" Start

At first, the AI tried to build the detective but made a classic rookie mistake. It built a detective who could only look at single clues but couldn't understand how clues connect (like "AND" or "OR"). It was like hiring a detective who can read a fingerprint but doesn't understand the concept of a crime scene.

The Fix: The researchers simply pointed out the missing piece. The AI realized, "Oh, right! Logic needs connectors!" and rewrote the code.

2. The "Self-Taught" Bug Fixer

As the detective started working, it made mistakes. Sometimes it would say a puzzle was impossible when it actually had a solution.

The Human Touch: Instead of fixing the code themselves, the researchers gave the AI a "fuzzing" tool. Imagine a machine that throws thousands of random, weird puzzles at the detective. When the detective gets one wrong, the AI analyzes the failure, figures out the bug, and fixes it on its own.
The Analogy: It's like a student taking a practice test, getting a question wrong, reading the explanation, and immediately fixing their study notes without the teacher having to re-teach the whole chapter.

3. The "Diamond" Problem (The Tricky Puzzle)

There was a specific type of puzzle called an "Equational Diamond." Imagine a maze where you have two paths that look different but lead to the same place. A standard detective would try to walk every single path, which takes forever (exponential time).

The Breakthrough: The researchers gave the AI a hint: "Look for shortcuts before you start walking." The AI invented a Preprocessing technique. It looked at the maze, realized the shortcuts, and collapsed the whole maze into a single straight line. Suddenly, puzzles that used to take hours were solved in milliseconds.

4. The "Courtroom" (Certification)

The most impressive part? The AI didn't just solve the puzzles; it had to prove it solved them correctly in a language called Lean (a language used by mathematicians to verify proofs).

The Challenge: The AI had to translate its internal logic into a formal legal argument that a "Judge" (the Lean proof checker) would accept.
The Struggle: The AI kept trying to make the Judge accept complex arguments that were too long. The researchers had to teach it: "Don't write a novel; just write the key points." Once they gave it a template of a perfect proof, the AI learned to write its own legal briefs perfectly.

The Results: How Good is the AI Detective?

The researchers tested their AI-built detective against the world's best human-built detectives (Z3 and cvc5).

The Score: The AI detective solved 7,468 out of 7,500 puzzles.
The Comparison: It was almost as good as the human champions, solving nearly the same number of problems.
The Catch: It was slightly slower on some puzzles because it didn't have the "instincts" (optimizations) that human engineers spend decades refining. But for a machine that wrote its own code from scratch? That's a home run.

The Big Lessons (The "Moral of the Story")

AI is Powerful but "Jagged": The AI is brilliant at complex tasks but can fail at silly, simple things (like forgetting that x = x is always true). It's like a genius who can write a symphony but forgets to tie their shoelaces.
You Need a Safety Net: You can't just let AI run wild. You need "timeout" limits (so it doesn't get stuck forever) and "fuzzing" (random testing) to catch errors.
The Future is Collaborative: The AI didn't replace the human; the human became the architect and the teacher. The human set the rules, provided the examples, and built the safety nets, while the AI did the heavy lifting of writing the code.

In short: This paper proves that with the right guidance, an AI can build a sophisticated reasoning tool from scratch, effectively acting as a "self-writing software engineer." It's not perfect yet, but it's a massive leap forward in what machines can create for themselves.

Here is a detailed technical summary of the paper "LLM2SMT: Building an SMT Solver with Zero Human-Written Code."

1. Problem Statement

The paper addresses a fundamental question in artificial intelligence: Can Large Language Models (LLMs) develop automated reasoning tools? While LLMs have been used for code generation and formal proof assistance, their ability to construct a complete, reliable, and competitive software system that itself performs logical reasoning (an SMT solver) remains largely unexplored.

The authors aim to build a DPLL(T)-style SMT solver for the theory of Quantifier-Free Uninterpreted Functions (QF_UF) using zero human-written code. The challenge lies in ensuring the correctness of complex algorithms (like congruence closure), handling edge cases in logic (e.g., Boolean terms vs. propositions), and integrating external tools (SAT solvers, theorem provers) without human intervention in the coding process.

2. Methodology

The study was conducted using Claude Code with the Sonnet 4.6 model. The process involved a human providing high-level specifications and prompts, while the LLM acted as a coding agent to write, debug, and optimize the C++20 codebase.

Key Technical Components:

Architecture: The solver follows the DPLL(T) framework. It abstracts theory atoms into Boolean variables, uses a SAT solver for propositional reasoning, and employs a dedicated theory solver for equality with uninterpreted functions.
SAT Solver Integration: Initially, the agent wrote a rudimentary SAT solver. The authors corrected this by explicitly prompting the agent to integrate CaDiCaL via the IPASIR-UP interface.
Theory Solver: Implemented the Nieuwenhuis-Oliveras congruence closure algorithm to handle equality reasoning.
Preprocessing: The agent was tasked with implementing simplification techniques, including unit propagation and short-circuiting for conjunctions/disjunctions.
Certification (Proof Generation):
- SAT: Models are validated by re-encoding them as SMT problems and checking against a reference solver.
- UNSAT: The agent generates Lean proofs. It uses grind to verify theory lemmas (congruence closure consequences) and bv_decide for the propositional fragment.
Development Loop:
- Fuzzing & Delta-Debugging: The agent generated random formula generators and differential testing scripts to compare its solver against reference solvers (Z3, cvc5).
- Resource Management: Humans added a permanent instruction to run the solver under a timeout command to prevent infinite loops.
- Iterative Prompting: Humans provided specific bug reports (e.g., handling xor as n-ary, distinguishing Booleans as terms) and examples (e.g., "equational diamond" problems) to guide the agent.

3. Key Contributions

Zero-Human-Code SMT Solver: The first demonstration of a complete DPLL(T)-style SMT solver for QF_UF written entirely by an LLM agent, including parsing, SAT solving, theory solving, and preprocessing.
Novel Preprocessing Technique: The agent independently devised a preprocessing method to solve "equational diamond" problems (disjunctions of equality chains). By computing the EUF-closure of each branch and extracting common consequences, the agent reduced exponential search space to linear time for these specific benchmarks.
Automated Proof Generation: The system successfully emits Lean proofs for unsatisfiable instances, bridging the gap between solver execution and formal verification.
Empirical Analysis of LLM Coding Agents: The paper provides a detailed case study on the "jagged intelligence" of LLMs—highlighting that while agents can implement complex algorithms, they often fail on trivial tasks (e.g., simplifying $t=t$ to true) without explicit prompting or testing.

4. Results

The solver was evaluated on QF_UF non-incremental benchmarks from SMT-LIB, comparing performance against Z3 and cvc5.

Solving Performance:
- llm2smt solved 7,468 instances out of the benchmark set.
- Z3 solved 7,500; cvc5 solved 7,494.
- The LLM-built solver is competitive, solving nearly as many instances as mature, human-engineered solvers.
Ablation Studies:
- Preprocessing: Removing preprocessing reduced solved instances from 7,468 to 7,355.
- Theory Propagation: Surprisingly, disabling theory propagation slightly improved performance on these specific benchmarks (7,478 solved vs. 7,468), suggesting that propagation overhead outweighed benefits for the tested dataset.
Certification:
- Out of 285 certified instances (with preprocessing), 259 were successfully verified in Lean.
- Failures were primarily due to Lean resource limits (recursion stack overflow, heartbeat timeouts) rather than logical errors. No erroneous proofs were found.

5. Significance and Observations

Feasibility: The study proves that LLMs can build competitive automated reasoning tools, provided they are supported by systematic scaffolding (fuzzing, explicit resource limits, and concrete examples).
Correctness Challenges: The "jagged intelligence" phenomenon is a major hurdle. The agent could implement the Nieuwenhuis-Oliveras algorithm but initially missed trivial simplifications ( $t=t \to \text{true}$ ) or mishandled Boolean types. This indicates that automated testing is more critical than human code review for LLM-generated logic tools.
Proof Generation Difficulty: Generating formal proofs (Lean) was the most difficult task. The agent struggled to distinguish between theory lemmas and the final proof structure, requiring human intervention to provide a correct example of a Lean proof for equational diamonds.
Future Directions: The authors suggest that future work should focus on better code analysis to guide the agent and expanding the solver to support more theories and quantifiers.

Conclusion:
The paper concludes with a "qualified yes" to the question of whether LLMs can develop reasoning tools. While the resulting solver is competitive, it relies heavily on human-defined constraints, testing frameworks, and specific guidance to overcome the agent's inability to self-correct subtle logical errors.