Competing nonlinearities, criticality, and order-to-chaos transition in deep networks

This paper demonstrates that statistically mixing activation functions (e.g., Tanh and Swish) creates a controllable, smooth phase transition to criticality at a specific mixing fraction, resolving the historical trade-off between scale-invariant signal propagation and differentiability while enhancing generalization and training performance.

Original authors: Omri Lesser, Debanjan Chowdhury

Published 2026-05-08
📖 4 min read☕ Coffee break read

Original authors: Omri Lesser, Debanjan Chowdhury

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a deep neural network as a massive, multi-story building where information (like a message or a signal) travels from the ground floor to the roof. For the building to work, the message needs to arrive at the top with the same strength it started with. If it gets too weak, it disappears; if it gets too loud, it distorts into noise.

For years, scientists have struggled with a "Goldilocks" problem: finding the perfect activation function (the rule neurons use to process information) that keeps the signal just right.

Here is the simple breakdown of what this paper discovered:

1. The Problem: The Signal Either Dies or Explodes

Think of the signal traveling through the network like a whisper passed down a long line of people.

  • The "Too Quiet" Team (Tanh): Some activation functions are like people who whisper so softly that by the time the message reaches the 10th floor, it's inaudible. The signal collapses.
  • The "Too Loud" Team (Swish): Other functions are like people who shout the message, causing it to get louder and louder with every floor until it's a deafening roar. The signal explodes.
  • The "Perfect" Team (ReLU): There is one famous function called ReLU that keeps the volume perfectly steady. However, it has a catch: it's "jagged" or "sharp" at the center. Imagine a staircase with a sharp, jagged edge. While it keeps the volume right, that sharp edge makes it impossible to use certain advanced tools (like smooth, curved optimization methods) that require a perfectly smooth surface.

2. The New Idea: A Random Mix of Neighbors

The authors asked: Can we get the perfect volume of ReLU without the jagged edge?

Instead of forcing every single neuron in the building to use the same rule, they proposed a statistical mixture. Imagine a building where, at the start, every single person (neuron) flips a coin:

  • If it's Heads, they use the "Too Quiet" rule (Tanh).
  • If it's Tails, they use the "Too Loud" rule (Swish).

Crucially, once they pick a rule, they stick with it forever. They don't switch back and forth.

3. The Magic Switch (The Critical Point)

The paper shows that by adjusting the mixing fraction (pp)—essentially changing the odds of the coin flip—you can find a "sweet spot."

  • If you have mostly "Quiet" people, the signal dies.
  • If you have mostly "Loud" people, the signal explodes.
  • But at a specific, precise ratio (around 83% Quiet and 17% Loud in their experiment), something magical happens.

At this specific "critical point," the quiet people cancel out the loud people's tendency to explode, and the loud people cancel out the quiet people's tendency to die. The result? The signal travels through the entire building with perfect, steady volume, just like the jagged ReLU, but because everyone is using smooth rules (Tanh and Swish), the whole system remains smooth and gentle.

4. Why This Matters: The "Regularizer" Effect

The paper also found a surprising bonus. Because the neurons are "frozen" into their random choices (some quiet, some loud), it creates a kind of structural disorder.

Imagine trying to memorize a list of nonsense words. If everyone in the group is identical, they can easily coordinate to memorize the nonsense perfectly. But if half the group is naturally quiet and half is naturally loud, they can't coordinate as easily to memorize the nonsense. They are forced to focus on the real patterns instead.

The authors tested this by giving the network "corrupted" data (wrong labels). They found that networks using this random mix were much better at ignoring the garbage data and learning the real patterns, acting like a built-in shield against overfitting.

5. The Bottom Line

The paper claims that by randomly mixing two different types of smooth activation functions, you can:

  1. Create a network that is critically balanced (signals don't die or explode).
  2. Keep the network smooth (unlike the jagged ReLU), allowing for better mathematical tools.
  3. Make the network more robust against learning from bad data.

They call this a "phase transition," similar to how water turns to ice at a specific temperature. In this case, the "temperature" is the mixing ratio, and the "ice" is a perfectly balanced, smooth, and robust neural network.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →