Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

Imagine you are trying to solve a massive maze. You are standing at the entrance, and there are thousands of possible paths branching out in front of you. Your goal is to find the exit.

This paper is about how a specific type of Artificial Intelligence (AI) learns to solve these mazes not by guessing one path at a time, but by exploring many paths simultaneously inside its own "brain."

Here is the breakdown of the paper's discovery using simple analogies:

1. The Problem: The "One-Path" Trap

Traditional AI models (like the ones that chat with you) usually think in a "discrete" way. Imagine they are walking through the maze with a blindfold, forced to pick one hallway to walk down.

If they pick the wrong hallway, they have to turn around, go back to the start, and try a different one.
This is slow and inefficient. If the maze is huge, they might get stuck or give up.

2. The Solution: "Continuous Thought" (The Superpower)

The paper studies a newer method called Chain of Continuous Thought (CoCoNUT).

The Analogy: Instead of walking down one hallway, imagine the AI has a magical ability to split its consciousness. It doesn't have to choose just one path. It can send out "ghosts" of itself down all the promising hallways at the exact same time.
In technical terms, instead of thinking in words (discrete tokens), it thinks in a smooth, continuous flow of numbers (a "latent space"). This allows it to hold multiple possibilities in its mind at once. This is called Superposition.

3. The Big Question: How does it learn to do this?

Previous research showed that if you hand-craft the AI's brain with the right settings, it can do this super-powerful parallel thinking. But the big mystery was: Can an AI learn this on its own just by practicing?

The authors asked: "If we let the AI train itself using standard methods (like a student studying for a test), will it naturally figure out how to split its attention and explore multiple paths, or will it get stuck picking just one?"

4. The Discovery: The "Goldilocks" Balance

The paper reveals that the AI does learn this naturally, but it happens in two distinct stages, governed by a specific "tension" inside the model.

Think of the AI's decision-making process as a hiking guide trying to lead a group through the maze. The guide has a "confidence meter" (called the Index-Matching Logit).

Stage 1: The Exploration Phase (Thought Generation)
- At first, the guide is unsure. If the confidence meter is too low, the guide is too timid and wanders randomly.
- If the confidence meter gets too high, the guide becomes a "know-it-all." They point at one path and say, "This is definitely it!" and ignore all other possibilities. This is dangerous because they might be wrong.
- The Magic: The paper proves that during training, the AI learns to keep this confidence meter in a "Goldilocks Zone." It stays bounded (not too low, not too high).
- Why this matters: Because the confidence is "just right," the guide doesn't commit to just one path. Instead, they say, "Okay, paths A, B, and C all look plausible. Let's send a scout down all three." This is the Superposition. The AI keeps multiple options alive in its mind.
Stage 2: The Decision Phase (Prediction)
- Once the scouts have explored the maze and found the exit, the AI needs to pick the winner.
- The paper shows that the AI learns to combine the "scout reports" (the superposition) with the "goal markers" (the two possible answers). It learns to boost the score of the correct answer until it stands out clearly, allowing it to give the final answer with high confidence.

5. The Proof: Watching the Growth

The researchers didn't just do math; they watched the AI train in real-time.

They tracked the "confidence meter" (the logit).
What they saw: In the old "discrete" methods, the confidence meter would shoot up to infinity (the AI gets overconfident and rigid).
What happened here: In the "Continuous Thought" method, the meter grew, hit a ceiling, and stabilized. It stayed in that perfect "Goldilocks" zone, allowing the AI to keep exploring multiple paths until it was sure.

Summary

This paper explains why a new type of AI reasoning works so well.

Old Way: The AI is a single hiker who gets stuck if they pick the wrong path.
New Way: The AI learns to be a swarm of hikers.
The Secret: It learns to keep its confidence "just right." Not so low that it's confused, and not so high that it stops listening to other possibilities. This balanced state allows it to hold many ideas in its head at once (Superposition), making it much better at solving complex puzzles like mazes, math problems, or logic riddles.

The authors conclude that this "balanced exploration" is the key mechanism that allows these models to scale up and solve harder problems without needing to be manually programmed. They just need to be trained correctly, and the superpower emerges naturally.

1. Problem Formulation

The paper addresses the theoretical gap in understanding Chain of Continuous Thought (CoT), a reasoning paradigm where a Large Language Model (LLM) maintains its reasoning trace in a continuous latent space rather than projecting it into discrete tokens.

Context: Previous work (Zhu et al., 2025) demonstrated that a two-layer transformer equipped with continuous CoT can theoretically solve Directed Graph Reachability problems (determining if a path exists between a root node $r$ and a destination) by maintaining a superposition of multiple reasoning traces simultaneously.
The Gap: While the existence of such a solution was proven via parameter construction, it remained unclear how gradient-based training methods (like backpropagation) naturally learn this complex superposition mechanism. Specifically, does the model learn to explore multiple paths in parallel, or does it commit prematurely to a single path?
Core Question: Do gradient-based methods naturally lead to the construction of superposition, and can this be theoretically proven?

2. Methodology

The authors analyze the training dynamics of a simplified two-layer transformer on the graph reachability task using Gradient Flow analysis.

A. Problem Setup

Task: Given a directed graph $G$ , a root $r$ , and two candidates $c_1, c_2$ , identify which candidate is reachable from $r$ .
Architecture: A two-layer transformer with weight tying.
- Layer 1: Performs a "copy" mechanism, mapping edge tokens to their source and target node embeddings.
- Layer 2: Performs the reasoning (thought generation) and prediction.
Continuous Thought: The model generates a sequence of continuous vectors $[t_1], [t_2], \dots, [t_C]$ where each $[t_c]$ represents the set of nodes reachable within $c$ steps.

B. Training Stages

The analysis is divided into two distinct stages:

Thought Generation Stage: The model autoregressively generates continuous thoughts $[t_{c+1}]$ based on previous thoughts $[t_c]$ and the graph structure. The goal is to expand the set of reachable nodes (BFS-like expansion).
Prediction Stage: The model uses the final thought $[t_C]$ and a special answer token $<A>$ to predict the correct candidate ( $c^\star$ ).

C. Key Analytical Tool: Index-Matching Logit ( $\mu$ )

The authors introduce the index-matching logit ( $\mu$ ) as a critical parameter quantifying the strength of the model's local search capability.

It controls how strongly the model attends to edges where the source node has already been explored.
The value of $\mu$ determines the balance between exploration (keeping multiple paths alive) and exploitation (focusing on the most likely path).

3. Key Contributions & Theoretical Results

A. Bounded vs. Unbounded Logits (The Core Discovery)

The paper proves a fundamental difference in training dynamics between continuous CoT and discrete CoT (or standard BFS losses):

Theorem 1: Under the standard COCONUT loss (which encourages predicting the specific demonstrated path), the index-matching logit $\mu(t)$ converges to a finite, bounded value ( $\mu^\star < \infty$ ) as training progresses.
Contrast: Under a COCONUT-BFS loss (which encourages predicting any reachable node), $\mu(t)$ diverges to infinity (grows logarithmically).
Significance: A bounded $\mu$ is crucial. If $\mu$ is too small, the model cannot distinguish edges (random guessing). If $\mu$ is too large (unbounded), the model becomes over-confident, assigning near-zero probability to alternative paths based on local features (like node degree), causing it to discard the correct path early. Boundedness forces the model to maintain a "soft" superposition of multiple plausible paths.

B. Emergence of Superposition

Theorem 2: The authors prove that with a positive, bounded $\mu$ , the continuous thought $[t_{c+1}]$ naturally implements a one-step expansion of the reachable set.
The output is a superposition of all nodes reachable in $c+1$ steps, with strictly positive mass on every valid node. This allows the model to perform implicit parallel Breadth-First Search (BFS) without explicit global planning.

C. Prediction Dynamics

Theorem 3: In the prediction stage, the model learns two signals: Residual Carryover (bringing the superposition of reachable nodes to the answer token) and Candidate Lift (boosting the logits of the two candidate nodes).
The gradient flow drives the ratio of these signals to converge to a specific direction that maximizes the margin between the reachable candidate $c^\star$ and the unreachable one, ensuring correct generalization even on unseen graphs.

4. Experimental Results

The authors validate their theory using a GPT-2 style decoder (2 layers) trained on the ProsQA dataset (graph reachability).

Logit Growth: Experiments tracking the attention logits confirm the theory.
- Under COCONUT (standard training), the logit difference between "frontier" edges and others grows initially and then saturates (stabilizes), matching the bounded $\mu$ prediction.
- Under COCONUT-BFS, the logit difference continues to grow without saturation, confirming the divergence prediction.
Length Generalization: The model trained on short reasoning chains (e.g., 2 steps) successfully generalizes to longer chains (3-4 steps) without explicit training, demonstrating that the superposition mechanism learned in early stages is reusable.
Accuracy: The model achieves 96.2% accuracy on the test set. Ablation studies show that deeper models (4+ layers) and specific architectural choices (like hidden width) further improve performance, but the 2-layer model is sufficient to demonstrate the mechanism.
Attention Patterns: Visualization of attention maps reveals that the model learns to copy node information to edge tokens (Layer 1) and then aggregate reachable nodes into a superposition (Layer 2), aligning with the theoretical "copy-then-superpose" mechanism.

5. Significance and Impact

Mechanistic Interpretability: This paper provides one of the first rigorous theoretical explanations for how and why superposition emerges in LLMs during training, moving beyond static parameter constructions to dynamic learning processes.
Efficiency of Continuous Reasoning: It explains why continuous CoT is more efficient than discrete CoT for complex reasoning: it avoids the "commitment error" of discrete tokens by naturally maintaining a probability distribution over multiple reasoning traces via bounded attention logits.
Training Dynamics Insight: The finding that bounded attention logits are essential for balancing exploration and exploitation offers a new lens for analyzing transformer training. It suggests that forcing logits to diverge (as in some discrete settings) may hinder complex reasoning tasks requiring parallel hypothesis testing.
Future Directions: The work lays the groundwork for scaling continuous reasoning paradigms more reliably, suggesting that training objectives should be designed to maintain bounded, informative attention distributions rather than encouraging over-confident, sparse attention.

In summary, the paper demonstrates that gradient-based training naturally induces a bounded attention mechanism that enables LLMs to maintain a superposition of reasoning traces, effectively solving complex graph problems through implicit parallel search.

Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought

1. The Problem: The "One-Path" Trap

2. The Solution: "Continuous Thought" (The Superpower)

3. The Big Question: How does it learn to do this?

4. The Discovery: The "Goldilocks" Balance

5. The Proof: Watching the Growth

Summary

1. Problem Formulation

2. Methodology

A. Problem Setup

B. Training Stages

C. Key Analytical Tool: Index-Matching Logit (μ\muμ)

3. Key Contributions & Theoretical Results

A. Bounded vs. Unbounded Logits (The Core Discovery)

B. Emergence of Superposition

C. Prediction Dynamics

4. Experimental Results

5. Significance and Impact

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

C. Key Analytical Tool: Index-Matching Logit ( $\mu$ )

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models