CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

Imagine you are asking a brilliant but overly chatty friend to solve a math puzzle for you.

Friend A says: "Okay, the time is 3:52. Let me think... 3:52. Is that 3 hours and 52 minutes? Yes. Does it have a zero? Maybe. Wait, does a digital clock show a zero? Sometimes. Let me check my notes on clocks. Okay, if it's 3:52, the digits are 3, 5, 2. But the question asks for the next time using 2, 3, and 5. Hmm. What if it's 03:52? Then there's a zero. That's a problem. Let me re-read the question. No, it says 2, 3, 5. Okay, so 3:52 is fine. But wait, what about 3:53? No, that repeats the 3. What about 3:55? No. Let me try 4:00. No. Let me try 5:23. Oh, that works! But wait, did I check 4:25? No. Let me check 4:25. No. Okay, 5:23 is the answer. It's 91 minutes away. But let me double-check. Is 5:23 the next one? What if I missed 4: something? Let me list all numbers. 2, 3, 5. Okay, 5:23. Yes. 91 minutes. Wait, let me verify the math again. 5:23 minus 3:52. 5 minus 3 is 2 hours. 23 minus 52 is... negative. Oh, I need to borrow an hour. 2 hours is 120 minutes. 120 minus 29 is 91. Okay, 91. But wait, is there a trick? No. 91. Final answer: 91."

Friend B says: "Okay, 3:52. The digits are 3, 5, 2. I need the next time with 2, 3, 5. The next hour is 4, but 4 isn't in the list. The next hour is 5. So the time must be 5:something. The remaining digits are 2 and 3. So it's either 5:23 or 5:32. 5:23 comes first. 3:52 to 5:23 is 1 hour and 31 minutes. That's 91 minutes. Done."

Both friends got the right answer (91 minutes). But Friend A wasted a lot of time, energy, and paper (or computer processing power) getting there. They went in circles, doubted themselves, and checked things they didn't need to. Friend B took the direct, efficient path.

The Problem: "Over-Thinking"

In the world of Artificial Intelligence, specifically "Large Reasoning Models" (LRMs), we have a similar problem. These AI models are getting very good at solving hard problems, but they often suffer from "Over-Thinking." They generate massive amounts of text (Chain-of-Thought) to solve simple problems. They loop back, check their own work too many times, and get stuck in "analysis paralysis." This costs a lot of money and time (computing power) without making the answer any better.

The Solution: CoTJudger

The paper introduces a new tool called CoTJudger. Think of CoTJudger as a super-smart editor or a traffic control system for these AI thoughts.

Here is how it works, using simple analogies:

1. Turning a Story into a Map

When an AI thinks, it writes a long, messy paragraph. CoTJudger takes this paragraph and turns it into a flowchart (a graph).

Nodes: Each sentence or thought becomes a "stop" on the map.
Arrows: The connections between thoughts become "roads."
Loops: If the AI says, "Wait, let me check that again," and then checks it, CoTJudger draws a circle on the map showing it went in a loop.

2. Finding the "Shortest Effective Path" (The Golden Route)

Once the map is drawn, CoTJudger asks a simple question: "If you had to get from the Problem to the Answer as fast as possible, which roads would you take?"

It finds the Shortest Effective Path (SEP). This is the "Golden Route"—the absolute minimum steps needed to solve the problem correctly.

Friend A's Path: A giant, tangled mess of loops and detours.
Friend B's Path: A straight, clean line.

3. Measuring the Waste

CoTJudger compares the AI's actual messy path against the "Golden Route."

Redundancy Ratio: It calculates how much of the AI's thinking was just "waste."
- Example: If the AI wrote 100 sentences, but the Golden Route only needed 10, the Redundancy Ratio is 90%. That means 90% of the work was unnecessary!

What Did They Discover?

The researchers tested 21 different AI models and found some funny and interesting patterns:

The "Obsessive Checker": Some models (like DeepSeek-R1) are like people who lock their front door, check it, lock it again, check the window, check the lock again, and then check if the key is in their pocket. They get stuck in loops of "verification" that don't help.
The "Wordy Explainer": Other models (like Qwen3-Max) don't loop, but they just talk too much. They explain the same thing in five different ways. It's like a friend who tells a joke, then explains the joke, then explains why the joke is funny, and then tells the joke again.
The "Distillation Bloat": When smaller AI models are trained to copy smarter ones (a process called "distillation"), they often copy the bad habits too. They learn to be chatty and inefficient, not just smart.
The "Efficient Heroes": Some models, like gpt-oss-120b, were found to be the most efficient. They got the right answer with the least amount of "thinking" waste.

Why Does This Matter?

Imagine you are paying for a taxi ride.

Old Way: You just pay for the total distance the car drove. If the driver took a scenic route, drove in circles, and got lost, you pay more.
CoTJudger Way: You pay for the direct distance from A to B. If the driver took a detour, CoTJudger tells you, "Hey, you paid for 10 miles, but the trip was only 2 miles. You wasted 8 miles of gas."

This tool helps developers:

Save Money: Stop paying for AI to generate useless text.
Fix Bad Habits: Show AI models exactly where they are looping or being redundant so they can learn to be more efficient.
Build Better AI: Create models that are not just smart, but also fast and frugal.

In short, CoTJudger is the tool that teaches AI models to stop over-thinking, stop talking in circles, and just get to the point.

Here is a detailed technical summary of the paper "CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs".

1. Problem Statement

Large Reasoning Models (LRMs) like OpenAI o1 and DeepSeek-R1 have achieved state-of-the-art performance by generating extended Chain-of-Thought (CoT) traces. However, this paradigm often leads to over-reasoning, characterized by:

Redundant Calculations: Unnecessary computations that do not contribute to the solution.
Circular Self-Verification: Infinite loops of checking and re-checking the same logic.
Structural Waste: Excessive verbosity, backtracking, and "hallucinated" exploration that increases computational cost without improving accuracy.

Limitations of Current Methods: Existing evaluation metrics rely heavily on coarse signals like total token count or final accuracy. These fail to distinguish between necessary complexity (deep reasoning required for hard problems) and structural redundancy (wasteful loops). Furthermore, manual annotation is costly and unscalable. There is a lack of automated tools to dissect the logical structure of CoTs to identify exactly where and why redundancy occurs.

2. Methodology: CoTJudger Framework

The authors propose CoTJudger, a graph-driven framework that transforms unstructured text-based CoTs into directed dependency graphs to quantify reasoning efficiency. The pipeline consists of six modules:

A. Step Segmentation and Atomization

Heuristic Segmentation: Initial splitting of CoT text based on delimiters (newlines), with protective masking for code blocks.
LLM-based Atomization: Using an LLM (GPT-5) to merge fragmented steps or split complex steps containing multiple logical actions. The output is a list of atomic reasoning nodes ( $V$ ).

B. Atomic Node Classification

A Universal Two-Tier Taxonomy is used to label each node.
- Tier 1 (Universal): Categories like Problem-Deconstruction, Intermediate-Inference, Reflection-or-Verification, Correction-or-Refinement, Repetition-or-Clarification, and Irrelevant-or-Redundant.
- Tier 2 (Domain-Specific): Specialized tags for Math (e.g., Equation-Setup), Programming (e.g., Test-Case-Analysis), and PCB (Physics/Chem/Bio).
The LLM infers the functional role of a node within the global reasoning flow, not just its surface text.

C. Answer Node Detection and Verification

The system identifies nodes containing candidate answers.
For non-coding tasks, an LLM verifies correctness against ground truth.
For programming tasks, correctness is verified via code execution in an isolated sandbox.

D. CoT Graph Construction

The linear CoT is converted into a Directed Graph $G = (V, E)$ .
Node Normalization: Semantically identical nodes (repetitions) are merged or assigned self-loops.
Edge Types:
- Forward Edges: Sequential progression.
- Backward Edges: Representing corrections, reflections, or re-evaluations of previous steps.
- Self-Loops: Indicating semantic repetition.
- Shortcut Edges: Bypassing redundant verification or correction loops if the logic is affirmed.

E. Shortest Effective Path (SEP) Extraction

The core innovation is the extraction of the Shortest Effective Path (SEP): the minimal, logically self-consistent sequence of nodes required to reach the correct answer.
The system enumerates all paths from the root to the answer node, filters for logical validity, and selects the shortest one.
Validation: An LLM verifies that the SEP alone is sufficient to derive the final answer.

F. Redundancy Metrics Calculation

The framework computes several metrics:

Redundancy Ratio ( $R$ ): $R = \frac{|V| - L_{eff}}{|V|}$ , where $|V|$ is total nodes and $L_{eff}$ is the SEP length. This measures the proportion of non-essential steps.
Average Degree ( $D$ ): Measures topological density. $D > 1$ indicates branching, loops, or backtracking.
Isolated Node Ratio: Fraction of steps that are irrelevant or redundant.
Logical Epicenters: Nodes with high in/out degrees, indicating "hotspots" of confusion or excessive looping.

3. Key Contributions

CoTJudger Framework: The first automated, structure-aware evaluator that converts free-form CoTs into dependency graphs to extract the SEP, separating reasoning ability from computational waste.
Functional Node Classification System: A domain-agnostic taxonomy that maps CoT spans to atomic reasoning behaviors, enabling interpretable attribution of redundancy.
Large-Scale Empirical Study: Evaluation of 21 LRMs (Proprietary, Open-Source, and Distilled) across Math, Programming, PCB, and General Reasoning.
New Metrics: Definition of the Redundancy Ratio ( $R$ ) and topological metrics (e.g., Average Degree) to provide a scalable, objective measure of reasoning efficiency.

4. Key Results and Findings

The evaluation of 21 models on 896 queries revealed pervasive redundancy and distinct failure modes:

Pervasive Redundancy: Most models spend a significant portion of their inference budget on non-essential steps. For example, Qwen3-Max had a Redundancy Ratio of 86.5%, spending 80%+ of its tokens on non-essential steps despite high accuracy.
Distinct Failure Patterns:
- Verification Obsession: Models like DeepSeek-R1 exhibit "Cyclic Complexity," with high Average Degrees ( $D \approx 1.75$ ) and "Logical Epicenters" where the model repeatedly loops back to verify or correct a specific step.
- Semantic Verbosity: Models like Qwen3-Max show high "Isolated Node Ratios" and self-loops, indicating global looseness and excessive self-clarification rather than local congestion.
- Compensatory Redundancy: Smaller or "Flash" models often generate excessive tokens to compensate for weaker per-step reasoning capabilities (test-time scaling).
Distillation Bloat: Distilled models (e.g., DeepSeek-R1-Distill) inherit and often amplify the redundancy of their teacher models, with Redundancy Ratios exceeding 69%.
The "Reasoning Illusion": Some models mimic the form of reflection (long CoTs) without the substance of stability, often degrading correct initial answers through "Deconstructive Revision" (changing a correct answer to an incorrect one during self-correction).
Domain Adaptation: While the logical backbone (Inference + Verification) is shared, redundancy patterns vary by domain. Programming tasks show high redundancy in Test-Case-Analysis, while Math tasks show redundancy in Numerical-Computation and Reflection.

5. Significance

Beyond Token Counts: CoTJudger moves the field beyond simple token-length metrics, providing a structural diagnosis of why a model is inefficient.
Targeted Optimization: By identifying specific failure modes (e.g., "Verification Obsession" vs. "Semantic Verbosity"), developers can design more targeted reward models or pruning strategies (e.g., penalizing specific graph topologies rather than just length).
Efficiency vs. Accuracy Trade-off: The framework quantifies the "Efficiency Gap," showing that high accuracy does not imply efficient reasoning. It enables the selection of models that achieve high performance with minimal computational waste.
Scalable Evaluation: Provides a fully automated, reproducible method for evaluating reasoning quality in the era of increasingly complex and expensive LRMs.

In conclusion, CoTJudger redefines reasoning quality to include the structural necessity of the reasoning trajectory, offering a critical tool for developing LRMs that are not only accurate but also computationally efficient.