Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

Imagine you have a very smart but chatty friend (an AI) who is great at solving math problems. Usually, when you ask them a question, they don't just give you the answer. They write out a long, step-by-step diary of their thought process: "Okay, first I need to add these numbers, then I realize I made a mistake, so I subtract that, then I multiply..."

This is called Chain-of-Thought (CoT). It works well, but it's slow and expensive. It's like hiring a lawyer who charges by the word; the more they talk to explain their logic, the more it costs you, and the longer you have to wait for the verdict.

The Problem: Too Much Talking, Not Enough Thinking

Researchers noticed that sometimes, the AI doesn't need to write down every single thought to get the right answer. It just needs to "think" about it internally.

Previous attempts to fix this tried to make the AI "think silently" inside its own brain (in a hidden "latent space") without writing anything down. But there was a catch: The AI didn't know when to stop thinking.

It was like asking your friend, "Solve this, but think for exactly 10 minutes."

If the problem was easy (like $2+2$ ), they wasted 9 minutes and 50 seconds just staring at the wall.
If the problem was hard, 10 minutes wasn't enough, and they gave up too soon.

The Solution: AdaAnchor (The "Smart Pause Button")

The paper introduces a new method called AdaAnchor. Here is how it works, using a simple analogy:

1. The "Mental Scratchpad" (Latent Anchors)

Instead of writing on a piece of paper (tokens), the AI uses a special mental scratchpad. Imagine a set of invisible sticky notes attached to the question.

Step 1: The AI looks at the question and writes a rough idea on the sticky notes.
Step 2: It looks at the notes, thinks, and rewrites the notes with a better idea.
Step 3: It repeats this, refining the notes over and over.

Crucially, none of this is written down for you to see. The AI is just updating these invisible notes in its head.

2. The "Stability Check" (Adaptive Halting)

This is the magic part. The AI has a built-in rule: "Stop thinking when the notes stop changing."

The Scenario: Imagine you are trying to solve a riddle. You keep guessing answers in your head.
- Guess 1: "Maybe it's a cat?" (Notes change)
- Guess 2: "No, maybe a dog?" (Notes change)
- Guess 3: "Wait, it's definitely a dog." (Notes change slightly)
- Guess 4: "Yeah, it's a dog." (Notes are exactly the same as before)

The moment the AI realizes, "Hey, my thoughts aren't changing anymore; I've found the answer," it hits the Stop Button.

Easy Problem: The AI thinks for 2 seconds, realizes it's done, and says the answer.
Hard Problem: The AI thinks for 20 seconds, refining its notes until they finally settle, then says the answer.

Why This is a Big Deal

The researchers tested this on math problems and found three amazing things:

It's Faster and Cheaper: Because the AI stops talking (generating text) almost immediately, it saves about 92-93% of the "words" it usually has to type out. It's like going from a 10-page essay to a one-word answer, but with the same intelligence.
It's Smarter: By letting the AI decide how long to think based on the difficulty of the problem, it actually gets more accurate than forcing it to think for a fixed amount of time. It spends more time on hard problems and less time on easy ones.
It's Efficient: It saves money and energy because the computer doesn't have to process thousands of extra words that nobody reads.

The Bottom Line

AdaAnchor is like giving an AI a "smart pause button." Instead of forcing it to write a long diary entry for every problem, it lets the AI do its thinking silently on invisible sticky notes. It keeps refining those notes until they stop changing, then it just hands you the final answer.

This means we can have AI that thinks deeply and solves hard problems, but doesn't waste time or money chattering about how it did it.

1. Problem Statement

Large Language Models (LLMs) typically rely on Chain-of-Thought (CoT) prompting, which generates explicit intermediate reasoning tokens to solve complex problems (e.g., mathematical word problems). While effective, this approach has significant drawbacks:

Computational Cost: Generating long reasoning traces increases inference latency, token usage, and serving costs, particularly in high-concurrency deployments.
Inefficiency: Models often perform unnecessary computation on easy problems, and the fixed nature of token generation prevents dynamic resource allocation.
Limitations of Existing Latent Methods: While "latent reasoning" approaches exist to shift computation into hidden states (avoiding token generation), many rely on a fixed number of refinement steps ( $K$ ) at inference. This introduces a hyperparameter that must be manually tuned for every model and dataset to balance accuracy and efficiency, failing to adapt to instance difficulty.

Goal: Develop a reasoning framework that performs multi-step computation implicitly (in the latent space) without generating intermediate tokens, while dynamically adjusting the computation budget based on the difficulty of each specific problem instance.

2. Methodology: AdaAnchor

The authors propose AdaAnchor, a framework that enables "silent" iterative reasoning by refining a set of learnable latent anchor vectors attached to the input.

A. Core Architecture

Anchor Vectors: Instead of generating text, the model maintains a compact set of $m$ learnable anchor vectors, $A^{(t)} \in \mathbb{R}^{m \times d}$ , where $t$ is the refinement iteration.
Input Augmentation: At each step, these anchors are projected into the embedding space and prepended to the input token embeddings: $E^{(t)} = [P(A^{(t)}); \text{Emb}(x)]$ .
Iterative Refinement: The model performs a forward pass on this augmented input. The hidden states corresponding to the anchor positions are extracted and used to update the anchor vectors for the next iteration.
- Update Rule: $A^{(t+1)} = (1-\beta)A^{(t)} + \beta A^{(t+1)}_{new}$ , where $\beta$ is a smoothing factor to ensure stable convergence.
Answer-Only Decoding: Once refinement terminates, the model generates only the final answer conditioned on the refined anchors and the original question. No intermediate reasoning tokens are emitted.

B. Adaptive Halting Mechanism

A key innovation is the stability-based adaptive halting strategy, which replaces fixed-step refinement with dynamic instance-wise computation:

Stability Metric: The system monitors the change in the anchor state between iterations. It calculates the cosine dissimilarity between the mean anchor representation of the current step ( $\bar{a}^{(t)}$ ) and the previous step:
$\Delta^{(t)} = 1 - \cos(\bar{a}^{(t)}, \bar{a}^{(t-1)})$
Halting Rule: Refinement stops when the update magnitude $\Delta^{(t)}$ remains below a threshold $\tau$ for $s$ consecutive steps.
Benefit: Easy instances converge quickly (fewer steps), while hard instances continue refining until convergence or a shared maximum budget ( $K_{max}$ ) is reached. This eliminates the need to tune a fixed step count per dataset.

3. Key Contributions

Implicit Reasoning Framework: Introduces a method to perform multi-step reasoning entirely within the latent space using learnable anchor vectors, avoiding the overhead of generating intermediate text tokens.
Adaptive Halting: Proposes a convergence-driven stopping criterion that dynamically allocates compute resources. It avoids over-computation on easy problems and under-computation on hard ones without requiring a separate halting controller or per-dataset hyperparameter tuning.
Efficiency-Accuracy Trade-off: Demonstrates that shifting computation to silent latent refinement can achieve competitive or superior accuracy compared to standard CoT while drastically reducing output token usage.

4. Experimental Results

The method was evaluated on three mathematical word-problem benchmarks: GSM8K, SVAMP, and MultiArith, using small backbone models (Qwen2.5-1.5B and Llama-3.2-1B).

Performance Metrics

Accuracy:
- AdaAnchor with adaptive halting outperformed fixed-step latent refinement ( $K=8$ ) by up to 5% in accuracy.
- It significantly outperformed "No CoT" baselines (gains of ~23–64% depending on the model).
Token Efficiency:
- Compared to explicit Chain-of-Thought (CoT), AdaAnchor reduced generated output tokens by 92–93%.
- It maintained an "answer-only" format, drastically lowering inference costs.
Step Efficiency:
- Under the same maximum step budget ( $K_{max}=8$ ), the adaptive halting mechanism reduced the average number of latent refinement steps by 48–60%.
- The distribution of steps showed that the model naturally stopped early for ~60% of examples, reserving the full budget for harder instances.

Ablation Studies

Fixed vs. Adaptive: Fixed-step refinement showed diminishing returns as $K$ increased, whereas adaptive halting maintained high accuracy with fewer average steps.
Stability Criterion: The cosine-based stability metric proved effective in identifying convergence points without additional training overhead.

5. Significance and Future Work

Significance:
AdaAnchor offers a practical solution to the "reasoning vs. cost" dilemma in LLMs. By moving computation from the discrete token space to a continuous latent space and introducing an adaptive stopping mechanism, it enables instance-wise compute allocation. This makes implicit reasoning more deployable in real-world scenarios where latency and token costs are critical constraints.

Limitations & Future Directions:

Heuristic Reliance: The current halting mechanism relies on a hand-designed stability heuristic, which may be sensitive to hyperparameters or distribution shifts. Future work suggests learning a halting policy via reinforcement learning or supervised controllers.
Interpretability: Unlike explicit CoT, the semantics of the latent anchor vectors are not directly interpretable. Future research aims to develop probing tools to visualize anchor trajectories and align them with human-interpretable sub-computations.

In summary, AdaAnchor represents a shift from "thinking out loud" (generating tokens) to "thinking silently" (refining latent states), achieving a superior balance between reasoning capability and computational efficiency.