CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

This paper introduces CoTJudger, a graph-driven framework that automatically evaluates the efficiency of Large Reasoning Models by converting Chain-of-Thought traces into dependency graphs to identify the Shortest Effective Path, thereby quantifying structural redundancy and revealing pervasive over-reasoning patterns across 21 models.

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao Huang

Published Tue, 10 Ma
📖 6 min read🧠 Deep dive

Imagine you are asking a brilliant but overly chatty friend to solve a math puzzle for you.

Friend A says: "Okay, the time is 3:52. Let me think... 3:52. Is that 3 hours and 52 minutes? Yes. Does it have a zero? Maybe. Wait, does a digital clock show a zero? Sometimes. Let me check my notes on clocks. Okay, if it's 3:52, the digits are 3, 5, 2. But the question asks for the next time using 2, 3, and 5. Hmm. What if it's 03:52? Then there's a zero. That's a problem. Let me re-read the question. No, it says 2, 3, 5. Okay, so 3:52 is fine. But wait, what about 3:53? No, that repeats the 3. What about 3:55? No. Let me try 4:00. No. Let me try 5:23. Oh, that works! But wait, did I check 4:25? No. Let me check 4:25. No. Okay, 5:23 is the answer. It's 91 minutes away. But let me double-check. Is 5:23 the next one? What if I missed 4: something? Let me list all numbers. 2, 3, 5. Okay, 5:23. Yes. 91 minutes. Wait, let me verify the math again. 5:23 minus 3:52. 5 minus 3 is 2 hours. 23 minus 52 is... negative. Oh, I need to borrow an hour. 2 hours is 120 minutes. 120 minus 29 is 91. Okay, 91. But wait, is there a trick? No. 91. Final answer: 91."

Friend B says: "Okay, 3:52. The digits are 3, 5, 2. I need the next time with 2, 3, 5. The next hour is 4, but 4 isn't in the list. The next hour is 5. So the time must be 5:something. The remaining digits are 2 and 3. So it's either 5:23 or 5:32. 5:23 comes first. 3:52 to 5:23 is 1 hour and 31 minutes. That's 91 minutes. Done."

Both friends got the right answer (91 minutes). But Friend A wasted a lot of time, energy, and paper (or computer processing power) getting there. They went in circles, doubted themselves, and checked things they didn't need to. Friend B took the direct, efficient path.

The Problem: "Over-Thinking"

In the world of Artificial Intelligence, specifically "Large Reasoning Models" (LRMs), we have a similar problem. These AI models are getting very good at solving hard problems, but they often suffer from "Over-Thinking." They generate massive amounts of text (Chain-of-Thought) to solve simple problems. They loop back, check their own work too many times, and get stuck in "analysis paralysis." This costs a lot of money and time (computing power) without making the answer any better.

The Solution: CoTJudger

The paper introduces a new tool called CoTJudger. Think of CoTJudger as a super-smart editor or a traffic control system for these AI thoughts.

Here is how it works, using simple analogies:

1. Turning a Story into a Map

When an AI thinks, it writes a long, messy paragraph. CoTJudger takes this paragraph and turns it into a flowchart (a graph).

  • Nodes: Each sentence or thought becomes a "stop" on the map.
  • Arrows: The connections between thoughts become "roads."
  • Loops: If the AI says, "Wait, let me check that again," and then checks it, CoTJudger draws a circle on the map showing it went in a loop.

2. Finding the "Shortest Effective Path" (The Golden Route)

Once the map is drawn, CoTJudger asks a simple question: "If you had to get from the Problem to the Answer as fast as possible, which roads would you take?"

It finds the Shortest Effective Path (SEP). This is the "Golden Route"—the absolute minimum steps needed to solve the problem correctly.

  • Friend A's Path: A giant, tangled mess of loops and detours.
  • Friend B's Path: A straight, clean line.

3. Measuring the Waste

CoTJudger compares the AI's actual messy path against the "Golden Route."

  • Redundancy Ratio: It calculates how much of the AI's thinking was just "waste."
    • Example: If the AI wrote 100 sentences, but the Golden Route only needed 10, the Redundancy Ratio is 90%. That means 90% of the work was unnecessary!

What Did They Discover?

The researchers tested 21 different AI models and found some funny and interesting patterns:

  • The "Obsessive Checker": Some models (like DeepSeek-R1) are like people who lock their front door, check it, lock it again, check the window, check the lock again, and then check if the key is in their pocket. They get stuck in loops of "verification" that don't help.
  • The "Wordy Explainer": Other models (like Qwen3-Max) don't loop, but they just talk too much. They explain the same thing in five different ways. It's like a friend who tells a joke, then explains the joke, then explains why the joke is funny, and then tells the joke again.
  • The "Distillation Bloat": When smaller AI models are trained to copy smarter ones (a process called "distillation"), they often copy the bad habits too. They learn to be chatty and inefficient, not just smart.
  • The "Efficient Heroes": Some models, like gpt-oss-120b, were found to be the most efficient. They got the right answer with the least amount of "thinking" waste.

Why Does This Matter?

Imagine you are paying for a taxi ride.

  • Old Way: You just pay for the total distance the car drove. If the driver took a scenic route, drove in circles, and got lost, you pay more.
  • CoTJudger Way: You pay for the direct distance from A to B. If the driver took a detour, CoTJudger tells you, "Hey, you paid for 10 miles, but the trip was only 2 miles. You wasted 8 miles of gas."

This tool helps developers:

  1. Save Money: Stop paying for AI to generate useless text.
  2. Fix Bad Habits: Show AI models exactly where they are looping or being redundant so they can learn to be more efficient.
  3. Build Better AI: Create models that are not just smart, but also fast and frugal.

In short, CoTJudger is the tool that teaches AI models to stop over-thinking, stop talking in circles, and just get to the point.