Imagine you are asking a brilliant but overly chatty friend to solve a math puzzle for you.
Friend A says: "Okay, the time is 3:52. Let me think... 3:52. Is that 3 hours and 52 minutes? Yes. Does it have a zero? Maybe. Wait, does a digital clock show a zero? Sometimes. Let me check my notes on clocks. Okay, if it's 3:52, the digits are 3, 5, 2. But the question asks for the next time using 2, 3, and 5. Hmm. What if it's 03:52? Then there's a zero. That's a problem. Let me re-read the question. No, it says 2, 3, 5. Okay, so 3:52 is fine. But wait, what about 3:53? No, that repeats the 3. What about 3:55? No. Let me try 4:00. No. Let me try 5:23. Oh, that works! But wait, did I check 4:25? No. Let me check 4:25. No. Okay, 5:23 is the answer. It's 91 minutes away. But let me double-check. Is 5:23 the next one? What if I missed 4: something? Let me list all numbers. 2, 3, 5. Okay, 5:23. Yes. 91 minutes. Wait, let me verify the math again. 5:23 minus 3:52. 5 minus 3 is 2 hours. 23 minus 52 is... negative. Oh, I need to borrow an hour. 2 hours is 120 minutes. 120 minus 29 is 91. Okay, 91. But wait, is there a trick? No. 91. Final answer: 91."
Friend B says: "Okay, 3:52. The digits are 3, 5, 2. I need the next time with 2, 3, 5. The next hour is 4, but 4 isn't in the list. The next hour is 5. So the time must be 5:something. The remaining digits are 2 and 3. So it's either 5:23 or 5:32. 5:23 comes first. 3:52 to 5:23 is 1 hour and 31 minutes. That's 91 minutes. Done."
Both friends got the right answer (91 minutes). But Friend A wasted a lot of time, energy, and paper (or computer processing power) getting there. They went in circles, doubted themselves, and checked things they didn't need to. Friend B took the direct, efficient path.
The Problem: "Over-Thinking"
In the world of Artificial Intelligence, specifically "Large Reasoning Models" (LRMs), we have a similar problem. These AI models are getting very good at solving hard problems, but they often suffer from "Over-Thinking." They generate massive amounts of text (Chain-of-Thought) to solve simple problems. They loop back, check their own work too many times, and get stuck in "analysis paralysis." This costs a lot of money and time (computing power) without making the answer any better.
The Solution: CoTJudger
The paper introduces a new tool called CoTJudger. Think of CoTJudger as a super-smart editor or a traffic control system for these AI thoughts.
Here is how it works, using simple analogies:
1. Turning a Story into a Map
When an AI thinks, it writes a long, messy paragraph. CoTJudger takes this paragraph and turns it into a flowchart (a graph).
- Nodes: Each sentence or thought becomes a "stop" on the map.
- Arrows: The connections between thoughts become "roads."
- Loops: If the AI says, "Wait, let me check that again," and then checks it, CoTJudger draws a circle on the map showing it went in a loop.
2. Finding the "Shortest Effective Path" (The Golden Route)
Once the map is drawn, CoTJudger asks a simple question: "If you had to get from the Problem to the Answer as fast as possible, which roads would you take?"
It finds the Shortest Effective Path (SEP). This is the "Golden Route"—the absolute minimum steps needed to solve the problem correctly.
- Friend A's Path: A giant, tangled mess of loops and detours.
- Friend B's Path: A straight, clean line.
3. Measuring the Waste
CoTJudger compares the AI's actual messy path against the "Golden Route."
- Redundancy Ratio: It calculates how much of the AI's thinking was just "waste."
- Example: If the AI wrote 100 sentences, but the Golden Route only needed 10, the Redundancy Ratio is 90%. That means 90% of the work was unnecessary!
What Did They Discover?
The researchers tested 21 different AI models and found some funny and interesting patterns:
- The "Obsessive Checker": Some models (like DeepSeek-R1) are like people who lock their front door, check it, lock it again, check the window, check the lock again, and then check if the key is in their pocket. They get stuck in loops of "verification" that don't help.
- The "Wordy Explainer": Other models (like Qwen3-Max) don't loop, but they just talk too much. They explain the same thing in five different ways. It's like a friend who tells a joke, then explains the joke, then explains why the joke is funny, and then tells the joke again.
- The "Distillation Bloat": When smaller AI models are trained to copy smarter ones (a process called "distillation"), they often copy the bad habits too. They learn to be chatty and inefficient, not just smart.
- The "Efficient Heroes": Some models, like
gpt-oss-120b, were found to be the most efficient. They got the right answer with the least amount of "thinking" waste.
Why Does This Matter?
Imagine you are paying for a taxi ride.
- Old Way: You just pay for the total distance the car drove. If the driver took a scenic route, drove in circles, and got lost, you pay more.
- CoTJudger Way: You pay for the direct distance from A to B. If the driver took a detour, CoTJudger tells you, "Hey, you paid for 10 miles, but the trip was only 2 miles. You wasted 8 miles of gas."
This tool helps developers:
- Save Money: Stop paying for AI to generate useless text.
- Fix Bad Habits: Show AI models exactly where they are looping or being redundant so they can learn to be more efficient.
- Build Better AI: Create models that are not just smart, but also fast and frugal.
In short, CoTJudger is the tool that teaches AI models to stop over-thinking, stop talking in circles, and just get to the point.