Competing nonlinearities, criticality, and… — Plain-Language Explanation

Imagine a deep neural network as a massive, multi-story building where information (like a message or a signal) travels from the ground floor to the roof. For the building to work, the message needs to arrive at the top with the same strength it started with. If it gets too weak, it disappears; if it gets too loud, it distorts into noise.

For years, scientists have struggled with a "Goldilocks" problem: finding the perfect activation function (the rule neurons use to process information) that keeps the signal just right.

Here is the simple breakdown of what this paper discovered:

1. The Problem: The Signal Either Dies or Explodes

Think of the signal traveling through the network like a whisper passed down a long line of people.

The "Too Quiet" Team (Tanh): Some activation functions are like people who whisper so softly that by the time the message reaches the 10th floor, it's inaudible. The signal collapses.
The "Too Loud" Team (Swish): Other functions are like people who shout the message, causing it to get louder and louder with every floor until it's a deafening roar. The signal explodes.
The "Perfect" Team (ReLU): There is one famous function called ReLU that keeps the volume perfectly steady. However, it has a catch: it's "jagged" or "sharp" at the center. Imagine a staircase with a sharp, jagged edge. While it keeps the volume right, that sharp edge makes it impossible to use certain advanced tools (like smooth, curved optimization methods) that require a perfectly smooth surface.

2. The New Idea: A Random Mix of Neighbors

The authors asked: Can we get the perfect volume of ReLU without the jagged edge?

Instead of forcing every single neuron in the building to use the same rule, they proposed a statistical mixture. Imagine a building where, at the start, every single person (neuron) flips a coin:

If it's Heads, they use the "Too Quiet" rule (Tanh).
If it's Tails, they use the "Too Loud" rule (Swish).

Crucially, once they pick a rule, they stick with it forever. They don't switch back and forth.

3. The Magic Switch (The Critical Point)

The paper shows that by adjusting the mixing fraction ( $p$ )—essentially changing the odds of the coin flip—you can find a "sweet spot."

If you have mostly "Quiet" people, the signal dies.
If you have mostly "Loud" people, the signal explodes.
But at a specific, precise ratio (around 83% Quiet and 17% Loud in their experiment), something magical happens.

At this specific "critical point," the quiet people cancel out the loud people's tendency to explode, and the loud people cancel out the quiet people's tendency to die. The result? The signal travels through the entire building with perfect, steady volume, just like the jagged ReLU, but because everyone is using smooth rules (Tanh and Swish), the whole system remains smooth and gentle.

4. Why This Matters: The "Regularizer" Effect

The paper also found a surprising bonus. Because the neurons are "frozen" into their random choices (some quiet, some loud), it creates a kind of structural disorder.

Imagine trying to memorize a list of nonsense words. If everyone in the group is identical, they can easily coordinate to memorize the nonsense perfectly. But if half the group is naturally quiet and half is naturally loud, they can't coordinate as easily to memorize the nonsense. They are forced to focus on the real patterns instead.

The authors tested this by giving the network "corrupted" data (wrong labels). They found that networks using this random mix were much better at ignoring the garbage data and learning the real patterns, acting like a built-in shield against overfitting.

5. The Bottom Line

The paper claims that by randomly mixing two different types of smooth activation functions, you can:

Create a network that is critically balanced (signals don't die or explode).
Keep the network smooth (unlike the jagged ReLU), allowing for better mathematical tools.
Make the network more robust against learning from bad data.

They call this a "phase transition," similar to how water turns to ice at a specific temperature. In this case, the "temperature" is the mixing ratio, and the "ice" is a perfectly balanced, smooth, and robust neural network.

Technical Summary: Competing Nonlinearities, Criticality, and Order-to-Chaos Transition in Deep Networks

Problem Statement
Deep neural networks rely on nonlinear activation functions to achieve expressive power, yet the propagation of signals and gradients through deep architectures is governed by the choice of these activations. In the infinite-width limit, the variance of preactivations follows a deterministic recursion. This recursion partitions activation functions into distinct "universality classes" based on the stability of their fixed points ( $K_\star$ ):

Scale-invariant (e.g., ReLU): $K_\star = 0$ is a fixed point with exact linear kernel recursion, ensuring criticality (depth-independent variance) for any initialization. However, ReLU is non-smooth (non-differentiable at $z=0$ ), rendering it unsuitable for curvature-based optimizers, physics-informed networks, and neural-network quantum states that require well-defined Hessians.
Half-stable (e.g., Swish, GELU): $K_\star = 0$ is unstable, and variance flows to a finite, stable fixed point $K_\star > 0$ . While these are smooth, they introduce a characteristic length scale and are sensitive to initialization.
Stable (e.g., Tanh, Sin): $K_\star = 0$ is a stable fixed point, causing variance to decay algebraically ( $K^{(l)} \sim 1/l$ ) with depth, leading to signal attenuation.

The central open problem addressed is whether these discrete universality classes can be bridged continuously. Specifically, can one tune a single parameter to transition between a variance-collapsing phase and a variance-inflating phase to achieve a critical point that is both scale-invariant and smooth?

Methodology
The authors propose a framework based on statistical mixtures of activation functions. Unlike deterministic mixtures where every neuron applies a weighted sum $\sigma(z) = p\sigma_1(z) + (1-p)\sigma_2(z)$ , this approach assigns each neuron independently and randomly to one of two activation functions, $\sigma_1$ or $\sigma_2$ , with probabilities $p$ and $1-p$ . This assignment is "quenched" (fixed at initialization).

In the infinite-width limit, self-averaging ensures that the effective kernel function $g(K)$ becomes a strict linear interpolation of the pure-component kernels:
$g^{(mix)}(K) = p g^{(\sigma_1)}(K) + (1-p) g^{(\sigma_2)}(K)$
This linearity allows the mixing fraction $p$ to serve as an analytically transparent control parameter. The authors derive the stability coefficient $a_1$ (governing the approach to the fixed point) for the mixture and identify the critical mixing fraction $p_c$ where $a_1^{(mix)}(p_c) = 0$ . This condition corresponds to a phase transition where the network becomes statistically scale-invariant.

The study focuses on a specific pairing: Tanh (stable class, $a_1 < 0$ ) and Swish (half-stable class, $a_1 > 0$ ). The authors analytically predict $p_c$ in the small-variance limit and perturbatively for finite input variance. They corroborate these predictions using three numerical diagnostics:

Variance Propagation: Tracking the evolution of preactivation variance $K^{(l)}$ with depth.
Susceptibilities: Measuring parallel ( $\chi_\parallel$ ) and perpendicular ( $\chi_\perp$ ) susceptibilities to detect the preservation of signal scale and sensitivity to input perturbations.
Lyapunov Exponents: Calculating the maximal Lyapunov exponent $\lambda$ to diagnose the order-to-chaos transition ( $\lambda < 0$ for ordered, $\lambda > 0$ for chaotic, $\lambda = 0$ for critical).

Key Results

Analytical Prediction: For the Tanh/Swish mixture, the critical mixing fraction is derived as $p_c = \frac{g_2^{(Tanh)}}{g_2^{(Tanh)} - g_2^{(Swish)}}$ . In the small-variance limit, this yields $p_c \approx 0.91$ . Perturbative analysis shows that finite input variance shifts this value downward.
Phase Transition: Numerical simulations confirm a sharp phase transition at $p_c \approx 0.83$ $p_{c} \approx 0.83$ (for unit input variance).
- For $p < p_c$ , the network is in a variance-collapsing phase (Tanh-dominated), where $K^{(l)}$ decays algebraically.
- For $p > p_c$ , the network is in a variance-inflating phase (Swish-dominated), where $K^{(l)}$ grows.
- At $p \approx p_c$ , the network exhibits emergent statistical scale invariance: variance remains depth-independent, mimicking ReLU's behavior but composed entirely of smooth, differentiable neurons.
Finite-Size Scaling: The transition sharpens with network depth $L$ , exhibiting finite-size scaling with a critical exponent $\nu = 1$ , consistent with a mean-field continuous phase transition.
Learning Performance: Training multilayer perceptrons (MLPs) on MNIST and Fashion-MNIST reveals non-monotonic test performance as a function of $p$ . The optimal test accuracy occurs near the theoretically predicted $p_c$ , demonstrating that the initialization-level transition directly impacts learned representations. Pure Tanh and pure Swish networks underperform compared to the critical mixture.
Implicit Regularization: In overparameterized networks with corrupted labels, the quenched disorder acts as an implicit regularizer. The mixture suppresses the memorization of noise (favored by Tanh's saturation) while preserving the capacity to learn genuine structure (favored by Swish's gradient flow). This breaks the permutation symmetry that homogeneous networks exploit to memorize spurious associations.

Significance and Claims
The paper establishes statistical activation mixtures as a controlled, analytically tractable tool for navigating the phase diagram of deep network universality classes. Its primary significance lies in resolving a longstanding tension: achieving scale-invariant propagation (criticality) without sacrificing smoothness.

Theoretical Contribution: It demonstrates that universality classes, previously viewed as discrete labels, are connected by a continuous family of statistical mixtures. The transition is analogous to measurement-induced phase transitions (MIPTs) in quantum circuits, driven by competing local operations with opposing tendencies.
Practical Utility: The framework offers a label-free, forward-pass-only protocol for selecting activation architectures. By estimating $p_c$ via the flattest variance profile or analytical formulas, practitioners can avoid expensive hyperparameter searches.
Domain Applicability: The ability to construct a critical, $C^\infty$ -smooth network is immediately actionable for domains requiring higher-order derivatives, such as natural-gradient optimizers, physics-informed neural networks (solving PDEs), and neural-network quantum states, where ReLU is ill-suited.

The authors conclude that this approach provides a new mechanism for order-to-chaos transitions in deep learning, where the "quenched disorder" of activation assignments serves both as a structural regularizer and a means to engineer criticality.

Competing nonlinearities, criticality, and order-to-chaos transition in deep networks