A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry

Imagine you have a brilliant, incredibly well-read student named LLM. This student has read almost every book in the library and can write beautiful essays, tell jokes, and summarize history. However, if you ask them to solve a tricky math problem or write a rigorous proof, they often stumble. They might get the right answer by guessing the pattern, but they can't explain why it's true without making logical mistakes. They are great at "sounding smart," but not always at "being right."

This paper introduces a new way to help this student become a reliable math genius by giving them two special tools: a Mentor and a Strict Teacher.

Here is how their new "Neuro-Symbolic" system works, broken down into simple steps:

1. The Problem: The "Hallucination" Trap

Usually, when you ask an LLM to prove a geometry theorem, it tries to guess the next word based on what it has seen before. It's like a student trying to solve a puzzle by remembering how similar puzzles looked, rather than actually understanding the rules. If the puzzle is slightly different, they get confused and make up facts that sound plausible but are wrong.

2. The Solution: A Two-Part Team

The authors built a system that pairs the LLM with two structured helpers:

Part A: The "Mentor" (Analogical Retrieval)

Instead of letting the student guess from scratch, the system first looks for similar problems that have already been solved correctly.

The Analogy: Imagine you are trying to fix a leaky faucet. Instead of guessing how to do it, you look at a manual for a very similar faucet that you know how to fix. You use that as a guide.
How it works: The system takes your new geometry problem, strips away the specific names and numbers (turning "Triangle ABC" into just "Triangle X"), and finds other problems that have the exact same structure. It then shows the LLM the proofs for those similar problems.
The Benefit: This gives the LLM a "cheat sheet" of the right logical steps, so it doesn't have to guess. It also helps the system ignore thousands of irrelevant math rules, focusing only on the few that matter for this specific type of problem.

Part B: The "Strict Teacher" (Symbolic Verifier)

Once the LLM writes a proof, it doesn't just get a "Good job!" or "Try again." It gets a robotic referee that checks every single step.

The Analogy: Think of a code compiler. If you write a program with a typo, the computer doesn't just say "It looks wrong." It points exactly to line 42 and says, "You used a variable that doesn't exist."
How it works: The LLM writes a proof. The "Strict Teacher" (a formal logic system) checks it step-by-step.
- Did you use a rule that doesn't apply here? Error.
- Did you assume something without proving it first? Error.
- Did you reach the right conclusion? Success.
The Loop: If the teacher finds an error, they tell the LLM exactly what went wrong. The LLM then rewrites the proof, fixing that specific mistake. They keep doing this loop until the proof is perfect or they run out of tries.

3. The Results: From "Maybe" to "Definitely"

The researchers tested this on hard SAT-level geometry problems.

Without help: The smartest AI models (like OpenAI's o1) only got about 10% of the proofs right on the first try. They were guessing.
With the Mentor and Teacher: The success rate jumped to 80%.
The Cost: Because the system only showed the LLM the relevant math rules (instead of the whole dictionary of 18,000 rules), it actually saved money and computing power.

Why This Matters

This isn't just about geometry. It's about trust.
Currently, we can't fully trust AI with critical tasks (like medical diagnoses or legal contracts) because they might hallucinate a fact. This paper shows a blueprint for how to make AI reliable:

Show them a similar example so they know the pattern.
Check their work with a rigid, unfeeling logic machine.
Let them try again until they get it right.

By combining the creative, flexible brain of the AI with the rigid, precise logic of a computer, we can create systems that don't just sound smart, but are actually correct. This is the future of building AI that we can truly trust with important jobs.

Here is a detailed technical summary of the paper "A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry."

1. Problem Statement

Large Language Models (LLMs) struggle in formal domains requiring rigorous logical deduction, such as mathematical proof generation. Their inherent architecture relies on probabilistic sequence generation based on statistical patterns rather than symbolic reasoning. Consequently, LLMs often fail to maintain logical coherence over long chains of reasoning, produce hallucinated theorems, or generate plausible-sounding but formally invalid proofs. The authors aim to bridge this gap by developing a system that enables LLMs to generate provably correct proofs, thereby improving reliability, accuracy, and consistency for safety-critical applications.

2. Methodology: A Neuro-Symbolic Framework

The proposed approach combines the generative capabilities of LLMs with two structured, neuro-symbolic components: Analogical Guidance and Symbolic Verification. The system operates on the FormalGeo-7k dataset (6,981 SAT-level Euclidean geometry problems).

A. Problem Abstraction and Analogical Retrieval

Abstraction: To identify structurally similar problems regardless of surface details (e.g., specific entity names or numbers), the system abstracts problems by replacing entity names with <word> and numbers with <num>.
Retrieval Model: A lightweight neural regressor is trained to predict the similarity between two problems based on their abstracted construction, conditions, and goals. The training data consists of ~655k balanced pairs of problems labeled by the Jaccard similarity of their ground-truth proofs.
In-Context Learning: For a target problem, the system retrieves the top- $k$ analogous problems (where $k=100$ is found optimal). These problems, along with their verified formal proofs, serve as few-shot examples to guide the LLM.
Dictionary Pruning: Crucially, the system narrows the Theorem Dictionary (originally 196 theorems) to include only the theorems used in the retrieved analogous proofs. This reduces the context size from ~18k tokens to ~2.5k tokens on average, lowering costs and focusing the model's search space.

B. LLM Proof Generation

The LLM is prompted with:

The target problem (textual and formal representation).
The top- $k$ analogous problems and their proofs.
The pruned theorem dictionary.
The LLM generates a formal proof step-by-step, invoking theorems from the provided dictionary.

C. Symbolic Verifier and Iterative Feedback

A symbolic verifier acts as an external oracle to validate the generated proof:

Mechanism: The verifier encodes proof steps and geometric constraints into logical formulas and algebraic expressions, utilizing the Z3 Theorem Prover (SMT solver). It does not know the ground-truth answer; it checks if the conclusion is logically entailed by the premises and steps provided.
Feedback Loop: If the proof is invalid, the verifier provides structured natural language feedback identifying the specific error. The LLM uses this feedback to revise the proof in an iterative loop (up to 5 retries per run).
Error Tiers: The verifier categorizes errors into three tiers:
1. Syntax Violation: Undefined theorems or incorrect arguments.
2. Premise Violation: Using a theorem without satisfying its required premises.
3. Goal Not Reached: The proof is syntactically and logically valid but fails to derive the specific goal (or derives a different value).

3. Key Contributions

Neuro-Symbolic System: A novel framework integrating analogical retrieval and symbolic verification to guide LLMs in formal proof generation.
Specialized Verifier: A symbolic verifier tailored for geometry that evaluates the entire proof structure (not just the final numeric answer) and provides tiered, actionable feedback.
Efficiency via Pruning: A method to dynamically reduce the theorem dictionary size by 86% (from 18k to 2.5k tokens) using analogous proofs, significantly reducing inference costs.
Comprehensive Evaluation: Rigorous testing on two state-of-the-art models (OpenAI's o1 and Gemini-2.5-Flash) demonstrating that the method works across different model families.

4. Experimental Results

The system was evaluated on 50 sampled problems (10 per difficulty level 1–5) from the FormalGeo-7k dataset.

Performance Gains:
- OpenAI o1: The full pipeline (Analogy + Verifier + Multiple Runs) achieved 80% accuracy, a 58%–70% absolute improvement over the base model (which achieved ~10% without retries).
- Gemini-2.5-Flash: Achieved 86% accuracy, representing a 52%–64% gain over its base performance.
Component Ablation:
- Analogy Retrieval: Improved base accuracy from 10% to 48% (single run, no retries).
- Verifier Feedback: Added an average of 20% gain for the analogy-based method and 28% for the base model by enabling retries.
- Multiple Runs: Further improved performance, particularly on harder problems (Levels 4–5).
Proof vs. Answer Accuracy: While the base model often found the correct numeric answer (90%), it failed to generate a valid proof (57.7%). The proposed method achieved 100% correct answers and 80% correct proofs, demonstrating that it helps the model "close the gap" between intuition and formal justification.
Cost Efficiency: The analogy-based approach reduced the average theorem dictionary input size to 13.6% of the full dictionary.

5. Significance and Future Work

Reliability: The work demonstrates that neuro-symbolic approaches can transform LLMs from "plausible text generators" into reliable engines for formal reasoning, a prerequisite for applications in education, science, and safety-critical systems.
Generalizability: While tested on geometry, the framework is designed to be adaptable to other domains supporting SMT-based verification.
Limitations: Current limitations include the lack of inverse trigonometric support in the Z3 prover, the inability to process visual diagrams (relying solely on formal text), and the dependency on proprietary closed-source models (o1).
Future Directions: The authors plan to extend the approach to dynamic systems (using temporal logics like LTL/CTL), improve analogy retrieval schemas, and explore educational applications where the system can guide students with targeted hints.

In conclusion, this paper establishes that combining retrieval-augmented generation (via analogies) with iterative symbolic verification creates a robust pipeline that significantly overcomes the logical reasoning limitations of current LLMs in formal mathematical domains.