TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Imagine you are asking a brilliant, over-enthusiastic student to solve a math problem.

The Problem: The "Overthinking" Student
This student (the AI) is incredibly smart. When you ask, "What is 2 plus 3?", they don't just say "5." Instead, they start a long monologue: "Let me think... 2 is a number. 3 is a number. If I put them together... hmm... wait, let me double-check. Is 2 plus 3 the same as 3 plus 2? Yes. Okay, I'm pretty sure it's 5. But just to be safe, let me write out the addition table again. And maybe check the history of numbers..."

They keep talking for thousands of words, even though they figured out the answer ("5") in the first sentence. This is called "overthinking." It wastes time, costs money (because computers use electricity to generate every word), and slows everything down.

The Goal: The "Stop" Button
Researchers wanted to teach this student when to stop talking. They knew there was a "perfect moment" to cut them off—right after they said "5" but before they started rambling about addition tables. If you cut them off there, they still get the right answer, but you save 50% of the time and energy.

The hard part? The student doesn't know when they are done. They just keep going until they run out of things to say.

The Solution: TERMINATOR
The paper introduces a new tool called TERMINATOR. Think of TERMINATOR not as a robot killer, but as a super-attentive TA (Teaching Assistant) sitting right next to the student.

Here is how it works, using a few analogies:

1. The "Hindsight" Training

First, the researchers needed to teach the TA what "done" looks like. They went back and looked at thousands of past conversations.

The Trick: They asked the student, "When was the very first time you actually said the answer?"
The Lesson: They marked that exact moment as the "Golden Stop Point." They taught the TA: "If the student says the answer, and then keeps talking, that's just fluff. Stop them immediately after the answer."

2. Reading the "Brain Waves"

You might think the TA just listens for the word "5." But the student might say "5" early on by accident, then change their mind. So, the TA looks deeper.

The Confidence Meter: The researchers noticed that when the student finally figures out the answer, their "confidence" spikes. It's like a sudden burst of energy.
The "Thinking" Tokens: The student uses specific filler words like "hmm," "wait," or "let me check" before they are sure. Once they have the answer, they stop using those words and start using words like "therefore" or "so."
The TA's Job: TERMINATOR is a tiny, super-fast detector that watches these "brain waves" (confidence levels) and "talking habits" (word choices) in real-time.

3. The "Sliding Window" Decision

TERMINATOR doesn't make a decision based on just one word. It looks at the last 10 words the student said.

If the TA sees a pattern where the student is confident and has stopped using "hmm" words, the TA raises a red flag.
Once the flag is raised enough times (a "majority vote"), the TA slams the Stop Button.
The student is forced to stop generating new words and immediately output the final answer.

Why is this a big deal?

Imagine you are paying for a taxi ride.

The Old Way: The driver takes you to your destination, but then keeps driving around the neighborhood for another hour just to "make sure they didn't miss a turn." You pay for the extra hour.
The TERMINATOR Way: A smart co-pilot sits in the back. The moment the driver says, "We're here," the co-pilot says, "Great, stop the car!" You arrive at the same time, but you only pay for the trip you actually needed.

The Results:
The paper tested this on hard math, coding, and science problems.

Speed: It cut the thinking time by 14% to 55%.
Accuracy: The answers were just as good as if the student had talked for the full hour.
Versatility: It worked on different types of "students" (different AI models) and different types of problems.

In a Nutshell:
TERMINATOR is a smart "stop-watch" for AI. It learns to recognize the exact moment an AI has solved a problem and cuts off the unnecessary rambling, saving time and money without losing any intelligence. It turns an over-enthusiastic student into an efficient one.

1. Problem Statement

Large Reasoning Models (LRMs) achieve high performance on complex tasks by generating extensive Chain-of-Thought (CoT) reasoning traces before producing a final answer. However, this capability leads to a phenomenon known as "overthinking," where models continue to generate tokens (double-checking, exploring alternative paths, or re-verifying) even after the correct final answer has logically been derived.

The Challenge: While prior work suggests an "optimal reasoning length" exists where truncating the CoT saves compute without hurting accuracy, determining this length dynamically during inference is non-trivial. Existing methods either require retraining the LRM, rely on heuristic thresholds that are dataset-specific, or fail to identify the exact moment the answer is first logically derived.
The Goal: To design an inference-time early-exit strategy that identifies the precise moment the LRM's final answer ( $\hat{a}$ ) first appears in the CoT, allowing the model to stop reasoning immediately and inject the end-of-sequence token.

2. Methodology: TERMINATOR

The authors propose TERMINATOR, a novel inference-time early-exit algorithm that utilizes a binary probe classifier to predict whether the final answer has been generated.

A. Core Concept: Hindsight-Optimal Reasoning Length (HORL)

The paper introduces the concept of HORL: the minimum number of tokens an LRM must generate to arrive at its final answer $\hat{a}$ .

Unlike ground-truth answers ( $a$ ), the model's own generated answer ( $\hat{a}$ ) is the target.
The "optimal" exit point is defined as the first logical arrival of $\hat{a}$ in the CoT sequence. Any tokens generated after this point are considered redundant.

B. Data Curation Pipeline

To train the exit predictor, the authors constructed a dataset of optimal-length CoTs using a robust, automated pipeline:

Answer Extraction: An LRM extracts the final answer $\hat{a}$ from the full solution $s$ .
Answer Identification: The LRM identifies the specific text span $d$ in the CoT $r$ that leads to the first occurrence of $\hat{a}$ .
Verification: The LRM verifies if the identified span $d$ actually contains $\hat{a}$ . If not, it retries with feedback.
Token Index Extraction: The exact token index $i^*$ of the first occurrence is extracted.

Significance: This pipeline avoids expensive human annotation and handles complex answer formats (numbers, LaTeX, code) that simple regex matching fails to capture.

C. Signal Analysis & Motivation

The authors analyzed LRM behavior to find observable signals indicating the arrival of $\hat{a}$ :

Token-Confidence Spikes: By aligning CoTs to the position of the first answer (event-locked averaging), they observed a sharp spike in Token-Confidence (inverse of entropy) exactly when $\hat{a}$ is generated, followed by a drop as the model begins to "doubt" itself or overthink.
Thinking Token Shifts: The frequency of specific "thinking tokens" (e.g., "hmm", "okay", "another") shifts significantly before and after the answer. For instance, "hmm" and "okay" appear more frequently before the answer, while "another" appears more frequently after.

D. The TERMINATOR Model

Architecture: A lightweight binary probe classifier (a single transformer layer initialized from the LRM's final layer + a prediction head).
Input: The hidden states of the LRM at each CoT token position.
Task: Binary classification ($0$ = answer not yet generated, $1$ = answer generated).
Training: Uses class-weighted binary cross-entropy loss to handle the severe class imbalance (most tokens are "before" the answer).
Inference:
- The probe predicts a probability $p_i$ for every token.
- A sliding window (size 10) monitors the predictions.
- If the majority of predictions in the window are $1$ (confidence $>0.7$ ), the system injects the `

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

1. The "Hindsight" Training

2. Reading the "Brain Waves"

3. The "Sliding Window" Decision

Why is this a big deal?

1. Problem Statement

2. Methodology: TERMINATOR

A. Core Concept: Hindsight-Optimal Reasoning Length (HORL)

B. Data Curation Pipeline

C. Signal Analysis & Motivation

D. The TERMINATOR Model

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank