Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

Imagine you are a manager at a busy call center. Your goal is to solve every customer's problem perfectly.

In the past, your company had a strict rule: "Every call gets exactly 10 minutes of your time."

This rule caused two big problems:

The "Over-Thinker" Problem: If a customer just asked, "What's 2+2?", you spent 10 minutes agonizing over it. You might get confused, change your answer to "3," then "5," and finally give the wrong answer because you over-complicated a simple thing.
The "Under-Thinker" Problem: If a customer asked a complex question like, "How do I fix a broken engine while calculating fuel efficiency?", 10 minutes wasn't enough. You ran out of time, gave up halfway, and gave a wrong answer because you didn't have enough time to finish the job.

This is the exact problem Large Language Models (AI) face today. They use the same amount of "brain power" for every question, whether it's easy or hard.

The paper introduces a new system called CoFiCot (Coarse-to-fine Adaptive Reasoning). Think of it as hiring a smart, adaptive supervisor who changes the strategy based on the difficulty of the call.

Here is how CoFiCot works, broken down into simple steps:

1. The "Triage" (The Quick Glance)

Before the AI starts solving the problem, it takes a quick look at the question and generates a few rough drafts of answers.

The Supervisor's Job: It checks these drafts like a doctor triaging patients.
- Are all the drafts agreeing? (High Confidence)
- Do the drafts look high quality? (Reliability)
- Does the question look like it needs a long explanation? (Complexity)

Based on this, the supervisor sorts the questions into three bins: Easy, Medium, and Hard.

2. The "Differentiated Strategy" (Tailored Solutions)

Now, the AI treats these bins differently:

🟢 The Easy Bin (The "Coffee Break" Zone):
If the question is simple (like "What's 2+2?"), the supervisor says, "Stop! You already have the right answer. Don't waste time thinking more."
The AI just picks the best answer from the first few drafts and moves on. This prevents the AI from "over-thinking" and accidentally messing up a simple answer.
🔴 The Hard Bin (The "Deep Dive" Zone):
If the question is complex (like a tricky math puzzle), the supervisor says, "This is tough. We need to work harder."
The AI enters a correction loop. It doesn't just guess again; it tries to fix its mistakes step-by-step.

3. The "Stateful Correction" (The "Don't Erase the Whole Blackboard" Trick)

This is the most clever part of the paper.

Imagine a student solving a math problem on a blackboard.

Old AI Method (Stateless): If the student makes a mistake in Step 3, the old AI would erase the entire blackboard and start writing the whole problem from Step 1 again. This is slow and often leads to new mistakes because the student forgets the logic they had in Step 1.
CoFiCot Method (Stateful): If the student makes a mistake in Step 3, CoFiCot says, "Wait! Step 1 and Step 2 were perfect. Keep them exactly as they are." It only erases Step 3, fixes it, and then writes Step 4 based on the new Step 3.

This is called Stateful Sequential Correction. It ensures that the AI remembers what it got right and only fixes what is broken, keeping the whole chain of logic connected and strong.

4. The "Quality Control" (The Reward Models)

To make sure the AI is actually fixing things and not just guessing, CoFiCot uses two special "judges":

The Step-Judge (PRM): Checks every single step of the reasoning. "Is this math right? Is this logic sound?"
The Final-Judge (ORM): Looks at the whole answer at the end. "Is this the best possible solution?"

If the Step-Judge finds a mistake, the AI fixes it. If the Final-Judge says the answer is great, the AI stops and submits the answer.

Why is this a big deal?

Saves Money & Time: It doesn't waste computer power on easy questions.
Smarter Results: It gives complex questions the deep thinking they need without getting confused.
No More "Over-thinking": It stops the AI from turning a simple "Yes" into a confusing "Maybe, but actually no..."

In a nutshell: CoFiCot is like a smart manager who knows when to let an employee relax on a simple task and when to step in and help them fix a specific error on a hard task, without making them start the whole project over again. It makes AI reasoning faster, cheaper, and much more accurate.

Here is a detailed technical summary of the paper "Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement."

1. Problem Statement: The Uniform Computation Paradox

Current Large Language Model (LLM) reasoning strategies face a fundamental inefficiency known as the Uniform Computation Paradox.

The Issue: Existing methods (e.g., Self-Consistency, Best-of-k, or fixed iterative refinement) allocate identical computational resources to every query, regardless of difficulty.
Consequences:
- Over-thinking (Simple Tasks): For easy problems, forcing extended reasoning or multiple iterations leads to "over-correction," where the model hallucinates and corrupts a correct initial answer into an incorrect one.
- Under-refinement (Complex Tasks): For difficult problems, a fixed, limited computational budget is insufficient to complete the necessary logical steps, leading to premature termination and unresolved errors.
Limitations of Current Solutions: While some adaptive methods exist, they often lack mechanisms to actively repair errors (focusing only on routing) or operate in a stateless manner, where correcting one step invalidates the logical flow of subsequent steps, causing context fragmentation.

2. Methodology: The CoFiCot Framework

The authors propose CoFiCot, a Coarse-to-fine Adaptive Framework that dynamically tailors inference strategies based on problem difficulty. The framework operates in a two-stage pipeline:

Stage 0: Data Preparation

The base LLM generates an initial ensemble of $k$ diverse reasoning traces (CoT solutions) via stochastic sampling.

Stage 1: Coarse-Grained Difficulty Classification

A multi-metric classifier triages queries into Easy, Medium, or Hard categories using three synthesized metrics:

Confidence Assessment (Semantic Entropy): Measures the consensus reliability of the initial ensemble. Low entropy (high agreement) suggests high confidence.
Reliability Assessment (Z-score): Validates the quality of the consensus. It checks if the majority answer cluster has a high Process Reward Model (PRM) score relative to the global distribution, filtering out "confident hallucinations."
Complexity Assessment (Predicted Steps): Uses a lightweight prompt to predict the number of reasoning steps required ( $N_{steps}$ ) without generating the full solution.

Decision Logic: A weighted synthesis of these metrics determines the difficulty label ( $D_{final}$ $D_{f ina l}$ ).
- Easy: Bypasses refinement; proceeds directly to aggregation.
- Medium/Hard: Enters the fine-grained refinement loop.

Stage 2: Fine-Grained Differentiated Refinement

For Medium and Hard problems, CoFiCot employs an Iterative Correction Loop featuring a Stateful Sequential Correction mechanism:

Error Localization: A Process Reward Model (PRM) scores each step of the reasoning chain. Steps falling below a threshold are flagged as errors.
Stateful Correction ( $\Phi$ ): Unlike stateless methods that regenerate the whole chain, CoFiCot treats correction as a history-generative process:
- It freezes the verified history (steps prior to the error).
- It regenerates the erroneous step and all dependent subsequent steps conditioned on the original question and the verified history.
- This ensures causal consistency, preventing the "domino effect" of invalidating correct logic downstream.
Selection & Termination:
- An Outcome Reward Model (ORM) evaluates the full refined solutions.
- The top- $k$ solutions are selected for the next iteration.
- Dynamic Early Exit: If the re-evaluated difficulty drops to "Easy," the loop terminates immediately to save compute.

3. Key Contributions

Adaptive Framework (CoFiCot): A novel architecture that resolves the uniform computation paradox by dynamically routing queries to either efficient aggregation (for easy tasks) or iterative refinement (for hard tasks).
Stateful Sequential Correction: A mechanism that formalizes error correction as a state-dependent trajectory. By anchoring verified history and regenerating only the dependent branch, it bridges the gap between granular error localization and global logical coherence.
Multi-Metric Triage: A robust classification system combining semantic entropy, consensus reliability, and predicted reasoning depth to accurately assess problem difficulty without human intervention.
Modularity: The framework is model-agnostic and can integrate various PRMs and ORMs, scaling performance with the quality of the reward models.

4. Experimental Results

The authors evaluated CoFiCot on seven benchmarks (Mathematical: GSM8K, MATH, SVAMP, SAT, MMLU; General: ARC, Date) using Llama-3-8B and GPT-3.5-Turbo.

Performance Gains:
- Llama-3-8B: Achieved 75.0% average accuracy, outperforming the strongest baseline (Best-of-k, $k=120$ ) by 4.0%. On the challenging MATH dataset, it improved accuracy by 6.5% (47.9% vs. 41.4%).
- GPT-3.5-Turbo: Achieved 80.5% average accuracy, surpassing the best baseline by 3.2%.
- Generalization: Demonstrated superior performance on non-math tasks (ARC: 88.2%, Date: 80.8%), proving applicability beyond mathematical domains.
Efficiency vs. Accuracy:
- CoFiCot achieved higher accuracy than baselines while consuming fewer or comparable tokens.
- Unlike baselines that suffer from performance saturation (accuracy plateaus as $k$ increases from 40 to 120), CoFiCot scales effectively, starting from a higher baseline with $k=40$ and outperforming $k=120$ baselines.
Ablation Studies:
- Removing the Coarse Stage led to over-correction on easy tasks (accuracy drop).
- Removing the Fine Stage caused a collapse on hard tasks (MATH accuracy dropped by 6.7%).
- Both PRM and ORM are essential; using only one leads to significant performance degradation.

5. Significance

Paradigm Shift: CoFiCot moves the field from "brute-force" scaling (allocating more compute to everything) to adaptive intelligence, mimicking human metacognitive triage.
Solving the Over-correction Problem: It provides a concrete solution to the issue where LLMs "think too much" and ruin correct answers, a critical flaw in current iterative refinement methods.
Logical Coherence: The Stateful approach solves the context fragmentation issue common in iterative methods, ensuring that corrections propagate logically without breaking the chain of thought.
Cost-Effectiveness: By dynamically exiting the refinement loop for easy problems, CoFiCot offers a superior trade-off between computational cost (tokens/latency) and reasoning accuracy, making advanced reasoning more feasible for deployment.