On Multi-Step Theorem Prediction via Non-Parametric Structural Priors

The Big Problem: The "Lost in the Woods" Effect

Imagine you are trying to solve a very complex geometry puzzle. You have a giant library of 300 different rules (theorems) you can use, like "If two lines are parallel, then angles are equal."

To solve the puzzle, you need to pick the right rule, apply it, get a new fact, pick the next right rule, and so on. This is like walking through a dense forest where every tree represents a possible move.

The Old Way (Vanilla ICL):
Researchers tried to use powerful AI (Large Language Models) to solve this by just giving them a few examples and saying, "Here's how we solved similar problems before; now you do it."

The Analogy: It's like giving a tourist a map with no landmarks and saying, "Just guess the path."
The Result: For short paths (easy puzzles), the AI does okay. But as the puzzle gets longer and more complex (deep in the forest), the AI gets confused. It starts picking random rules that don't fit, like trying to use a "Pythagorean Theorem" when you need to prove two angles are equal. The AI gets lost, and its success rate crashes to near zero. The authors call this "Structural Drift." The AI forgets the logical order of the forest.

The New Solution: The "Smart Guidebook" (Pri-TPG)

The authors propose a new method called Pri-TPG. Instead of letting the AI wander blindly, they give it a dynamic, custom-made guidebook for every single puzzle.

Here is how it works, step-by-step:

1. The "Memory Lane" (Retrieval)

When a new geometry problem arrives, the system doesn't just look at the problem; it looks at its "doppelgangers" in a database of past solved puzzles.

Analogy: Imagine you are trying to fix a weird leak in your kitchen sink. Instead of guessing, you ask a smart librarian, "Who has fixed a sink like this before?" The librarian pulls out 200 photos of similar sinks that were successfully fixed.

2. The "Flowchart" (Theorem Precedence Graph)

The system takes those 200 past solutions and builds a flowchart (a directed graph). This flowchart shows the order in which rules were used.

Analogy: The librarian doesn't just give you a pile of photos. She draws a map showing: "First, you must tighten the valve (Rule A). Only AFTER that can you replace the washer (Rule B). You cannot replace the washer before tightening the valve, or it won't work."
This map is called the Theorem Precedence Graph (TPG). It tells the AI: "You are at Step 1. Here are the only 3 moves you are allowed to make next."

3. The "Bouncer" (Symbolic Executor)

The AI (the planner) picks a move from the allowed list. But before the move is accepted, a strict "Bouncer" (a symbolic solver) checks it.

Analogy: The AI says, "I want to use Rule B!" The Bouncer checks the current state of the puzzle. "Nope, you haven't tightened the valve yet. Rule B is blocked. Try Rule A."
This prevents the AI from making illegal moves that would break the logic chain.

4. The "Iterative Loop" (Step-by-Step)

The AI doesn't try to write the whole solution at once. It takes one step, gets checked by the Bouncer, updates its map, and takes the next step.

Analogy: It's like playing a game of chess where you make one move, the computer checks if it's legal, updates the board, and then you make the next move. You don't try to predict the whole game in one breath.

Why This is a Big Deal

No "Schooling" Required: Most AI models need to be "trained" (studied for months) on specific math problems to get good at them. If the math library changes, you have to re-train them.
- Pri-TPG is "Training-Free." It learns on the fly by looking at past examples. It's like a genius student who can solve a new type of math problem just by looking at a few similar examples from a textbook, without needing to go to summer school.
Solving the "Long Chain" Problem: The biggest breakthrough is that this method works even for very long, difficult puzzles (6+ steps).
- The Result: On a standard benchmark, the old "guessing" AI got about 26% of the hard problems right. The new Pri-TPG method got 89% right. It matched the performance of the most expensive, heavily trained super-AIs, but without the cost of training.

Summary in One Sentence

The paper teaches AI how to solve complex, multi-step logic puzzles by giving it a custom-made, step-by-step flowchart based on past solutions, acting as a strict guide that prevents the AI from getting lost in the forest of possibilities.

1. Problem Definition

The paper addresses Multi-Step Theorem Prediction in automated geometric reasoning. The core task involves an agent navigating a complex search space to select a valid sequence of theorems from a large library ( $\mathcal{L}$ ) to transform an initial symbolic state ( $S_0$ ) into a state satisfying a target goal ( $g$ ).

Key Challenges Identified:

Combinatorial Explosion: As reasoning depth increases, the number of possible theorem sequences grows exponentially ( $|\mathcal{L}|^H$ ), making valid path finding difficult.
Structural Drift: The authors identify a critical failure mode in standard In-Context Learning (ICL). As reasoning chains lengthen, vanilla ICL performance degrades sharply (often to near zero). This is attributed to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration and compounding errors.
Generalization Limits: Existing neural-symbolic approaches rely on supervised parametric models trained on fixed theorem sets. These models struggle to generalize to evolving or unseen theorem libraries without costly retraining.

2. Methodology: Pri-TPG

The authors propose Pri-TPG (Prior-guided multi-step theorem prediction via Theorem Precedence Graphs), a training-free, non-parametric framework. It treats the LLM as a planner guided by explicit structural priors rather than learned weights.

The framework consists of three main components:

A. Theorem Precedence Graphs (TPG)

Instead of treating theorem selection as a flat classification problem, Pri-TPG models the problem as a constrained planning task on a directed graph $G=(V, E)$ .

Nodes: Represent theorems.
Edges: Represent temporal/causal dependencies (i.e., Theorem $u$ must be applied before Theorem $v$ because $u$ 's conclusion is a prerequisite for $v$ ).
Construction: These graphs are distilled from historical solution traces, encoding the latent topological order of geometric reasoning.

B. Query-Adaptive Prior (Retrieval-Augmented)

To handle specific problem contexts, the system does not use a static global graph. Instead, it dynamically constructs a query-specific graph ( $G_q$ ):

Multimodal Retrieval: A multimodal encoder maps the problem text, diagram, and initial formal state into a joint embedding space.
Nearest Neighbor Search: The system retrieves the top- $K$ most similar historical problems from a database.
Graph Synthesis: The TPGs of these retrieved instances are aggregated to form $G_q$ , providing a candidate set of theorems ( $\mathcal{L}_q$ ) and their precedence relationships specific to the current query.

C. State-Aware Prior (Iterative Execution)

The framework employs an interleaved loop between the LLM (Planner) and a Symbolic Solver (Executor):

Symbolic Pruning: At each step $t$ , the symbolic solver validates the current state $S_t$ against the candidate theorems. Only theorems with satisfied preconditions are retained, eliminating mathematically invalid actions.
Structural Localization: The search space is further narrowed to the descendants of the previously applied theorem in $G_q$ .
Candidate Prioritization: Remaining candidates are scored using a composite function:
- $s_{goal}$ : Semantic similarity between the theorem's conclusion and the target goal.
- $s_{graph}$ : Edge weights in $G_q$ promoting immediate successors (coherence).
- $s_{hist}$ : Penalty for previously invoked theorems (anti-looping).
LLM Selection: The LLM selects the next theorem from this highly pruned, prioritized list, ensuring the generation remains within the valid structural constraints.

3. Key Contributions

Identification of Structural Drift: The paper formally defines and analyzes the phenomenon where vanilla ICL fails in long-horizon reasoning due to a lack of structural constraints, highlighting the need for explicit topological priors.
Non-Parametric Structural Priors: Introduction of Pri-TPG, a framework that extracts query-specific structural guidance from historical data without gradient-based optimization. It shifts the paradigm from "learning to reason" to "retrieving structure to reason."
Training-Free Scalability: The method achieves state-of-the-art performance without fine-tuning the LLM, allowing immediate adaptation to new theorem libraries.

4. Experimental Results

The method was evaluated on the FormalGeo7k benchmark (and others like Geometry3K and GeoQA).

Overall Performance: Pri-TPG (using GPT-5.2) achieved 89.29% accuracy, significantly outperforming:
- Vanilla ICL: Which collapsed to ~26% accuracy.
- LLM-only Direct Solving: Which achieved ~73% (GPT-5.2).
- Training-based Neural-Symbolic Solvers: Surpassing the best supervised model (FGeo-HyperGNet at 88.36%).
Depth Analysis:
- Vanilla ICL: Performance drops to near 0% at reasoning depths L5–L6.
- Pri-TPG: Maintains robust performance (66.13% at L5, 30% at L6), demonstrating that structural priors effectively prune the search space even for complex problems.
Ablation Studies:
- Removing the TPG (keeping only RAG) dropped accuracy to 72.64%.
- Removing iterative symbolic feedback (single-pass) caused a catastrophic drop to 34.3%.
- This confirms that both retrieval (candidate narrowing) and structural priors (precedence constraints) are essential.

5. Significance and Impact

Paradigm Shift: The work demonstrates that explicit structural priors can be more effective than parametric training for scaling symbolic reasoning. It bridges the gap between the flexibility of LLMs and the rigor of symbolic solvers.
Efficiency: By reducing the per-step search space by ~90% (from ~300 theorems to ~30 candidates), the method mitigates the "search-depth bottleneck" inherent in long-chain reasoning.
Educational Application: The framework offers a training-free, interpretable approach to geometry problem solving, suitable for tutoring systems that require reliable, checkable solution traces without the cost of retraining models as theorem libraries evolve.
Future Directions: The authors note limitations in handling extremely long-horizon global consistency (L6+), suggesting future work on non-parametric mechanisms to incorporate global reasoning depth.

In conclusion, Pri-TPG establishes that structure is a more powerful prior than parameters for multi-step theorem prediction, offering a scalable, training-free solution to a central challenge in automated reasoning.