DAG-Math: Graph-of-Thought Guided Mathematical Reasoning in LLMs

The Big Problem: "Right Answer, Wrong Reasoning"

Imagine you are taking a math test. You write down a long, messy paragraph of thoughts, and at the very end, you write the correct answer: 42.

Your teacher (the AI evaluator) sees "42" and gives you an A. But if they look closely at your work, they might see that you guessed, you got lucky, or you used a shortcut that only works for this specific problem but makes no sense logically.

Current AI models (LLMs) are great at getting the "42," but we don't really know how they got there. Are they actually thinking step-by-step, or are they just guessing until they hit the right number?

The Solution: The "DAG-Math" Blueprint

The authors of this paper propose a new way to look at how AI thinks. They call it DAG-Math.

Think of a standard AI answer like a stream of consciousness—a river flowing in one direction. It's hard to see where the water came from or if it hit a dead end.

DAG-Math turns that river into a city map (specifically, a Directed Acyclic Graph, or DAG).

Nodes (The Intersections): Every single thought or calculation the AI makes is a specific stop on the map.
Edges (The Roads): Every time the AI uses a previous thought to make a new one, it draws a road connecting them.
No Loops: You can't drive in a circle. You can't use a conclusion to prove itself. You must move forward.

The New Metric: "Logical Closeness"

The paper introduces a new way to grade the AI called Logical Closeness.

Imagine the AI is building a house.

Standard Grading (PASS@k): Did the house have a roof? Yes? Great, you get a passing grade.
DAG-Math Grading (Logical Closeness): Did the roof sit on the walls? Did the walls sit on the foundation? Did every brick have a reason for being there?

If the AI builds a roof that floats in the air because it forgot to build the walls, it gets a low Logical Closeness score, even if the roof looks perfect.

The paper defines "Perfect Reasoning" as a path where:

Every step is connected to the steps before it (no floating roofs).
The path leads to the correct answer.
There are no dead-end streets or irrelevant detours.

What They Found (The "Aha!" Moments)

The researchers tested this on famous math competitions (like the AIME) using top AI models (Gemini, GPT, Qwen). Here is what they discovered:

1. The "Search vs. Reasoning" Gap

The Analogy: Imagine a person trying to find a key in a dark room.
- Reasoning: They use a flashlight, check the corners logically, and find the key.
- Search: They just swing a bat around wildly until they hit the key by accident.
The Finding: Many AI models are "bat swingers." They generate tons of random possibilities (search) until they stumble on the right answer. This makes them look smart on standard tests (high PASS@1), but when you check their "Logical Closeness," they are actually quite messy. The paper found a huge gap between "getting the right answer" and "reasoning correctly."

2. Harder Problems = Messier Maps

When the math problems get harder, the AI's "city map" gets bigger and more chaotic.
For easy problems, the map is a straight line.
For hard problems, the AI builds a massive, sprawling city with many dead ends and loops. The models that succeed are the ones that can prune the dead ends and find the straight path.

3. Thinking Models vs. Fast Models

Models that have a "thinking mode" (where they pause to think before answering) did better at building clean, logical maps. However, even the smartest models still struggle with "Logical Closeness." They often get the answer right but take a weird, illogical route to get there.

Why This Matters

This paper is like giving the AI a mirror.

Before, we only knew if the AI got the answer right or wrong.
Now, we can see how it got there.

This helps developers fix AI models. Instead of just telling the AI "Get the right answer," they can now say, "Build a logical map where every step connects to the next." This moves AI from being a lucky guesser to being a true logical thinker.

Summary in One Sentence

DAG-Math stops AI from getting away with "lucky guesses" by forcing it to draw a clear, connected map of its thoughts, proving that it actually understands the math, not just the answer.

1. Problem Statement

While Large Language Models (LLMs) have demonstrated strong performance on mathematical problems using Chain-of-Thought (CoT) prompting, it remains unclear whether this success stems from genuine logical reasoning, rote memorization, or search-based heuristics.

Limitations of Current Metrics: Standard evaluation relies on PASS@k (final answer accuracy), which ignores the reasoning process. A model can arrive at the correct answer through "lucky" search, irrelevant exploration, or mixing multiple solution paths, masking flawed logic.
Limitations of Existing Frameworks: Prior attempts to model reasoning (e.g., Boolean circuits, Markov chains, or trees) often fail to capture long-range dependencies, cross-branch connections, or the goal-directed nature of mathematical proofs. Formal verification tools (like LEAN) are rigorous but require pre-formalized problems, making them impractical for general answer-based benchmarks.

The core challenge is to establish a rigorous framework that models CoT as a structured process and an evaluation metric that distinguishes between "searching for an answer" and "deriving an answer via logical inference."

2. Methodology: The DAG-MATH Framework

The authors propose modeling CoT as a rule-based stochastic process over Directed Acyclic Graphs (DAGs).

A. Theoretical Framework

Step-Level Abstraction: Instead of token-level analysis, the framework operates at the step level. A CoT step is decomposed into:
- Node (Conclusion): The derived state or value.
- Edge (Justification): The logical rule or inference connecting parent nodes to the current node.
Task-Specific DAG ( $G(x_{in})$ ): For any input prompt, a latent DAG exists where:
- Source Nodes ( $V_{in}$ ): Derived directly from the problem statement.
- Intermediate Nodes ( $V_{inter}$ ): Logical deductions.
- Sink Nodes ( $V_{out}$ ): Final answers (correct or incorrect).
- Assumption: The graph is acyclic (no circular dependencies).
Stochastic Process: The LLM generates a trajectory over this DAG. The transition probability depends only on the set of parent nodes that have already been visited, enforcing logical dependency.

B. Key Metrics

The paper introduces Logical Closeness to evaluate the quality of the reasoning trajectory:

Logical Closeness: A DAG is "logically closed" if every non-sink node has at least one outgoing edge (i.e., every intermediate step is used in a subsequent step). This ensures no "dead-end" reasoning or irrelevant information is included.
Perfect Reasoning: A trajectory is "perfect" if it is logically closed and terminates at the correct sink node.
Perfect Reasoning Rate (PRR): The probability that an LLM generates a perfectly reasoned trajectory.
$PRR = E[\delta_{close} \times \delta_{final}]$
Where $\delta_{close}$ indicates logical closure and $\delta_{final}$ indicates answer correctness.
AUC (Area Under Curve): A relaxed metric measuring reasoning performance as the threshold for logical closeness varies from 0% to 100%.

C. Benchmark Construction (DAG-MATH)

To enable evaluation, the authors constructed a structured benchmark:

Format: They introduced the DAG-MATH format, a structured JSON output where each step explicitly lists Edge (reasoning), Parents (dependencies), and Node (conclusion).
Three-Stage Pipeline:
- Stage 1: Generate atomic steps (Nodes) with strict granularity.
- Stage 2: Assign minimal direct dependencies (Parents) to ensure acyclicity and closure.
- Stage 3: Generate the Edge justifications.
Dataset: A gold-standard benchmark of 2,894 DAGs derived from the Omni-MATH dataset (difficulty levels 1–6), verified by both symbolic execution (SymPy) and human evaluation.

3. Key Contributions

Formal Framework: Established a principled DAG-based framework for CoT, defining reasoning as a stochastic process on a task-specific graph with absorbing sink states.
New Metrics: Introduced Logical Closeness and PRR, providing a diagnostic tool to separate "correct answers by chance" from "logically sound derivations."
Structured Benchmark: Created the DAG-MATH benchmark, forcing LLMs to output structured, dependency-aware reasoning, enabling the extraction of graph statistics.
Empirical Insights: Demonstrated that standard CoT often relies on search strategies that inflate PASS@1 while failing logical consistency checks.

4. Experimental Results

The authors evaluated five representative LLMs (Gemini-2.5, GPT-4.1, Qwen3) on high-difficulty datasets (AIME 2025, BRUMO 2025, HMMT 2025).

The "Reasoning Gap": There is a statistically significant gap between PASS@1 (final answer accuracy) and PRR (perfect reasoning rate).
- Example: On AIME 2025, Gemini-2.5-Flash achieved 52.4% PASS@1 but only 17.0% PRR.
- Implication: Models often reach the correct answer by exploring many branches (search) but fail to maintain a single, logically closed derivation path.
Graph Statistics vs. Difficulty:
- Harder problems yield larger, sparser DAGs with higher branching complexity (high max out-degree).
- Perfect Reasoning trajectories correspond to smaller, denser graphs with low branching, indicating focused, efficient reasoning.
- Incorrect trajectories often exhibit "exploratory branching" (high out-degree) where the model speculates on multiple paths but fails to converge logically.
Model Comparison: While PASS@1 varies significantly across models, PRR remains relatively stable across different model families, suggesting that the inherent ability to perform "perfect reasoning" is comparable, but search strategies differ in effectiveness.
Thinking Models: Models with "thinking" modes (e.g., DeepSeek-R1, Gemini-Flash with thinking) show improved PRR, but the gap between search-based accuracy and logical consistency persists.

5. Significance and Future Impact

Beyond "Black Box" Evaluation: The framework moves beyond "did it get the right answer?" to "did it get there logically?" This is crucial for safety-critical applications where hallucinated reasoning is unacceptable.
Goldilocks Principle: DAG-MATH offers a balance between the flexibility of natural language CoT and the rigor of formal proof systems (like LEAN), without requiring pre-formalization of problems.
Algorithmic Guidance: The metrics can guide future training:
- Search Policies: Use logical closeness as a reward signal in Monte Carlo Tree Search (MCTS) to prune irrelevant branches.
- RL Training: Implement curriculum learning where models are progressively rewarded for higher degrees of logical closure.
Generalization Analogy: The authors draw a parallel to supervised learning, defining "under-reasoning" (missing steps) and "over-reasoning" (redundant steps), positioning "perfect reasoning" as the optimal generalization point.

In conclusion, DAG-MATH provides a rigorous mathematical definition of reasoning in LLMs, revealing that current models often rely on search heuristics rather than strict logical inference, and offers a concrete path toward improving the reliability and interpretability of AI reasoning.