Multiple Descents in Deep Learning as a Sequence of… — Plain-Language Explanation

Original authors: Wenbo Wei, Fan Xu, Nicholas Chong Jia Le, Choy Heng Lai, Ling Feng

Published 2026-06-16

📖 5 min read🧠 Deep dive

Original authors: Wenbo Wei, Fan Xu, Nicholas Chong Jia Le, Choy Heng Lai, Ling Feng

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: A Rollercoaster of Learning

Imagine you are teaching a robot to recognize whether a movie review is "good" or "bad." Usually, we expect the robot to get better and better the more we train it, until it hits a ceiling and then starts to get confused (a phenomenon known as overfitting).

However, this paper discovered something strange and exciting: The robot didn't just get better and then worse. It went through a wild rollercoaster ride.

After the robot seemed to have "learned enough," its performance didn't just slowly decline. Instead, it would get worse for a while, then suddenly jump to being much better, then get worse again, and jump up again. The researchers call this "Multiple Descents." It's like the robot is climbing a mountain, sliding down a bit, finding a hidden shortcut, and suddenly leaping to a higher peak, only to repeat the process several times.

The Secret Ingredient: Order vs. Chaos

Why does this happen? The authors looked inside the robot's "brain" (specifically a type of network called an LSTM) and found that these jumps happen exactly when the robot's internal state switches between two modes: Order and Chaos.

Think of the robot's internal thinking process like a crowd of people in a room:

Order: Everyone is marching in perfect lockstep. If you nudge one person, everyone else stays exactly the same. The system is stable, rigid, and predictable.
Chaos: Everyone is dancing wildly. If you nudge one person, the whole room goes into a frenzy. Small changes lead to huge, unpredictable differences.

The researchers found that the robot performs best when it is standing right on the edge between marching in lockstep and dancing wildly. This is called the "Edge of Chaos."

The Journey: One Big Leap, Then Many Small Jumps

The paper reveals a specific pattern in how the robot travels through these states:

The First Big Leap (The Best Moment):
At the very beginning of the training, the robot is too rigid (too ordered). As training continues, it suddenly shifts into the "Edge of Chaos" for the first time. This is the moment the robot performs its absolute best. It's like the robot finally found the perfect balance where it can explore new ideas without falling apart. The "width" of this transition zone is very wide, giving the robot plenty of room to find the perfect way to solve the problem.
The Rollercoaster (Multiple Descents):
After that first perfect moment, the robot keeps training. It gets too chaotic, performance drops, and then it snaps back to a new "Edge of Chaos." It does this over and over again. Each time it snaps back, performance jumps up again (a "descent" in error), but these jumps are usually not as good as that very first one.

The Analogy: Tuning a Radio

Imagine you are trying to tune an old-fashioned radio to find a clear station.

Ordered Phase: The radio is stuck on a frequency with no signal (static silence).
Chaotic Phase: The radio is spinning wildly, picking up every station at once (loud noise).
The Edge of Chaos: You find the sweet spot where the music is crystal clear.

The paper suggests that the first time you hit that sweet spot, the music is the clearest it will ever be. But if you keep turning the dial, you might hit other clear spots later on. However, those later spots are narrower and harder to find, and the music isn't quite as perfect as the first time.

What They Did to Find This

The researchers trained a robot on 50,000 movie reviews. They didn't just look at the final score; they watched the robot's "heartbeat" (its internal mathematical stability) at every single step of the training.

They used a physics trick: they gave the robot a tiny "nudge" (a small amount of noise) and watched what happened.

If the nudge died out quickly, the robot was in Order.
If the nudge grew into a giant wave, the robot was in Chaos.
They found that every time the robot's performance suddenly improved (the "descent"), it was because the robot had just switched from a chaotic state back to a stable state, landing right on that "Edge of Chaos."

The Takeaway

The main discovery is that the best time to stop training a deep learning model is often the very first time it hits that "Edge of Chaos."

While the model can keep finding new "sweet spots" later on (causing the performance to jump up and down), the very first time it finds that balance is usually the peak performance. The paper suggests that understanding these "Order-Chaos" transitions helps us see why deep learning models sometimes surprise us with sudden improvements after they seem to have failed.

Technical Summary: Multiple Descents in Deep Learning as a Sequence of Order-Chaos Transitions in LSTM Networks

Problem Statement
Deep learning training dynamics are complex, often characterized by phenomena such as overfitting, underfitting, and performance fluctuations. While the "double descent" phenomenon (a U-shaped curve where test error increases with model complexity before decreasing again) has been studied, recent observations suggest more complex behaviors. This paper investigates a novel "multiple descents" phenomenon observed in Long Short-Term Memory (LSTM) networks during training on real-world tasks. Specifically, the authors observe that after a model becomes overtrained, the test loss does not merely plateau or increase monotonically; instead, it undergoes long cycles of increasing loss followed by sharp, abrupt declines. The central problem addressed is understanding the dynamical mechanism behind these multiple descents and identifying the conditions under which optimal model performance occurs.

Methodology
The study employs a combination of deep learning training and asymptotic stability analysis from dynamical systems theory.

Experimental Setup: The authors trained an LSTM network on the Large Movie Review Dataset (IMDb) for sentiment analysis. The model was over-trained for 1,000 epochs to induce overfitting and observe long-term dynamical behaviors. The architecture included an embedding layer (32 dimensions), an LSTM layer (60 units), and a fully connected output layer.
Asymptotic Stability Analysis: To characterize the internal state of the network, the authors treated the LSTM as a non-linear dynamical system. They employed a perturbation-based approach to measure the asymptotic stability of the output recurrent unit ( $h_t$ $h_{t}$ ).
- Procedure: For each training epoch, a test sample was processed through the network up to the input length (500 tokens). The input was then extended with zero vectors for an additional 1,100 steps (total 1,600 steps) to allow the system to evolve without external driving forces.
- Perturbation: A small Gaussian noise perturbation ( $\epsilon$ ) was added to the hidden state at step 500. The system was iterated to step 1,599 for both the original and perturbed states.
- Metrics: The asymptotic distance ( $D = |h'_{1599} - h_{1599}|$ ) was calculated. A distance converging to zero indicates an ordered phase (stability), while divergence indicates a chaotic phase.
- Additional Indicators: The study also utilized the reduced sum of the output vectors ( $h_{1599} \cdot \mathbf{1}$ ) to visualize bifurcation and calculated the Finite Time Lyapunov Exponent (FTLE) to confirm phase transitions.

Key Results

Multiple Descents and Phase Transitions: The experiments revealed that during the overfitting regime (epochs > 450), the test loss exhibits multiple cycles of increase followed by sharp drops. Each cycle corresponds to a transition between order and chaos.
- As the model enters a chaotic phase, the test loss increases and the asymptotic distance grows.
- A sharp drop in test loss coincides with a sudden transition from chaos back to order.
Optimal Performance at the First Transition: The global minimum test loss (best performance) consistently occurred at the first transition from order to chaos (observed around epoch 114 in the primary experiment). At this point, the "edge of chaos" (the transition region) was the widest, allowing for the most extensive exploration of weight configurations. Subsequent transitions yielded only local optima.
Analogy to 1D Maps: The phase diagram of the LSTM training dynamics phenomenologically resembles the bifurcation diagram of a one-dimensional $tanh$ map. The first order-to-chaos transition is the widest (slowest), followed by narrower, faster transitions. This suggests that the increasing magnitude of weight matrices during stochastic gradient descent (SGD) acts similarly to increasing the control parameter in a 1D map, driving the system through a sequence of bifurcations.
Conditions for Occurrence: The multiple descents phenomenon was observed to be dependent on hyperparameters. It did not emerge when the learning rate was too small or the model size was too small (staying in the ordered phase) or when using SGD with very slow convergence (failing to traverse multiple phases). However, the relationship between loss drops and order-chaos transitions remained consistent even when full multiple descents were not visible.

Key Contributions

Identification of Multiple Descents: The paper documents a specific pattern of multiple performance descents in LSTM networks that extends beyond the traditional double descent framework.
Dynamical Systems Interpretation: It establishes a direct empirical link between these performance cycles and the asymptotic stability (order vs. chaos) of the network's internal dynamics.
Optimal Epoch Localization: The study identifies that the global optimum in training often coincides with the first entry into the chaotic regime (the widest "edge of chaos"), rather than the final state of the network.
Theoretical Parallel: It draws a plausible theoretical connection between high-dimensional neural network training and low-dimensional non-linear maps (specifically the $tanh$ map), suggesting that the sequence of order-chaos transitions is driven by the growth of weight norms.

Significance and Claims
The authors claim that their findings offer a new perspective on when and why neural networks achieve peak performance, shifting the focus from static model properties (size, dataset) to the intrinsic dynamical trajectory of training.

Practical Implication: Understanding these transitions could lead to new training strategies, such as stopping training at the first order-to-chaos transition to maximize generalization, rather than relying solely on early stopping heuristics or regularization.
Theoretical Extension: The work extends the "edge of stability" and "edge of chaos" concepts, suggesting that while the edge of chaos is generally optimal, the first transition is uniquely significant for global optimality in recurrent networks.
Modesty and Limitations: The authors explicitly note that while the phenomenon resembles 1D maps, the high-dimensional nature of LSTMs may involve more complex interactions not fully captured by prime-numbered periodicities. They acknowledge that the phenomenon was not observed in all settings (e.g., specific learning rates or optimizers) and that the connection to the "edge of stability" (Hessian-based) requires further theoretical investigation. The paper does not claim the phenomenon is universal across all architectures but highlights it as a significant observation in recurrent networks that warrants further study.

Multiple Descents in Deep Learning as a Sequence of Order-Chaos Transitions in LSTM Networks