Activation Function Design Sustains Plasticity in Continual Learning

Imagine you are training a dog to do tricks. First, you teach it to sit. Then, you teach it to shake hands. Finally, you teach it to roll over.

In the world of Artificial Intelligence (AI), this is called Continual Learning. The goal is for the AI to keep learning new things without forgetting the old tricks.

However, AI has a problem. Sometimes, after learning a few new tricks, the AI gets "stuck." It remembers the old tricks perfectly, but it loses the ability to learn new ones. It becomes rigid, like a statue that can't move its joints. Scientists call this "Loss of Plasticity."

This paper argues that the secret to keeping an AI flexible isn't just about giving it more brain power or better training methods. It's about changing the activation function.

What is an "Activation Function"?

Think of an AI as a massive team of tiny workers (neurons) passing notes to each other.

The Input: A note arrives at a worker's desk.
The Activation Function: This is the worker's decision rule. It decides: "Do I pass this note along? Do I shout it out? Or do I throw it in the trash?"

If the decision rule is too strict, the worker throws away too many notes (the AI stops learning). If the rule is too chaotic, the worker shouts everything (the AI gets confused).

The Problem: The "Dead Zone"

The most common decision rule used in AI today is called ReLU. Imagine a worker who says:

"If the note is positive (good news), I'll pass it on. If the note is negative (bad news), I'll throw it in the trash and never look at it again."

In a stable world, this works fine. But in a changing world (Continual Learning), things get tricky. Sometimes, the "bad news" (negative numbers) actually contains the key to learning a new trick. If the worker throws it away, the AI loses that information forever. The worker becomes a "dead unit"—a zombie neuron that never fires again. The AI's brain fills up with these zombies, and it can't learn anything new.

The Solution: The "Goldilocks" Zone

The authors of this paper discovered that the best decision rule isn't "all or nothing." It needs to be just right.

They found a "Goldilocks Zone" for how workers should handle bad news:

Don't throw it away completely: Even if the note is negative, the worker should still pass a tiny version of it along. This keeps the worker "alive" and ready to learn.
Don't scream it too loud: If the worker passes the negative note too loudly, it causes chaos and instability.
Be smooth: The transition from "passing good news" to "passing bad news" should be a smooth curve, not a sharp, jagged cliff.

The New Tools: Smooth-Leaky & Randomized Smooth-Leaky

Based on this, the authors invented two new "decision rules" (activation functions):

Smooth-Leaky: Imagine a worker who usually passes good news, but when bad news comes, they don't throw it away. Instead, they gently leak a little bit of it through a small crack in the door. This keeps the door from jamming shut.
Randomized Smooth-Leaky: This is like having a team of workers where, every time a note arrives, they randomly decide how much of the bad news to leak. Sometimes a little, sometimes a bit more. This randomness keeps the team on their toes and prevents them from getting stuck in a rut.

Why Does This Matter?

The authors tested these new rules in two very different worlds:

The Classroom (Supervised Learning): Teaching the AI to recognize different types of images one after another.
The Video Game (Reinforcement Learning): Teaching an AI to walk, run, and jump in a physics simulation that changes over time.

The Result?
In both cases, the AI using the new "Smooth-Leaky" rules kept learning new tricks for much longer. They didn't get "stuck" or forget how to adapt. They remained flexible, like a gymnast, rather than rigid like a statue.

The Big Takeaway

For a long time, scientists thought the way to fix AI learning problems was to build bigger brains or use smarter training algorithms. This paper says: "Stop overcomplicating it."

Sometimes, the solution is as simple as changing the personality of the neurons. By making them slightly more open to "bad news" (negative inputs) and keeping their decision-making process smooth, we can keep AI flexible and ready to learn forever.

In short: To keep an AI young and adaptable, don't let its neurons go to sleep. Give them a gentle nudge to keep working, even when things get tough.

1. Problem Statement

Continual Learning (CL) requires neural networks to learn new tasks sequentially without erasing previously acquired knowledge. While catastrophic forgetting (loss of old knowledge) is well-studied, this paper focuses on loss of plasticity: a phenomenon where models retain past capabilities but progressively lose the ability to adapt to new data or distribution shifts.

The authors argue that while activation functions are often treated as secondary hyperparameters in independent and identically distributed (i.i.d.) settings, they play a primary, architecture-agnostic role in mitigating plasticity loss in non-stationary environments (both supervised and reinforcement learning). Standard activations like ReLU suffer from "dead units" (zero gradients), while saturating functions (Sigmoid, Tanh) suffer from vanishing gradients, both of which hinder adaptation when data distributions shift.

2. Methodology & Analysis

The authors conducted a systematic, property-level analysis of activation functions to identify the geometric characteristics that sustain plasticity. Their methodology involved:

Case Study 1: The "Goldilocks Zone" for Negative Slopes.
- They swept the negative-side slope ( $\bar{s}$ ) across various activation families (Leaky-ReLU, RReLU, PReLU, Swish, GeLU, etc.).
- Finding: There exists an optimal range for negative-side responsiveness, termed the "Goldilocks zone" ( $0.6 \lesssim \bar{s} \lesssim 0.9$ ).
- Failure Modes:
  - If $\bar{s} \to 0$ : High fraction of "dead units" (gradient starvation).
  - If $\bar{s} \to 1$ (or higher): Optimization instability, characterized by sharp spikes in principal curvature ( $\lambda_{max}$ ) and reduced effective rank of the gradient Gram matrix.
- Adaptivity: Learnable slopes (e.g., PReLU) often drift outside this optimal zone during training unless constrained.
Case Study 2: Desaturation Dynamics under Shocks.
- The authors introduced a stress protocol involving "scaling shocks" (multiplying pre-activations by factors $\gamma \in \{0.25, 0.5, 1.5, 2.0\}$ ) to simulate distribution shifts.
- Key Metric: Dead-Band Width (DBW), defined as the fraction of the input range where the derivative $|\phi'(x)| < 10^{-3}$ .
- Findings:
  - Non-zero derivative floors (e.g., Leaky-ReLU, RReLU) are critical for rapid desaturation and recovery.
  - Two-sided saturation (Sigmoid, Tanh) leads to the highest failure rates in recovery.
  - Smoothness vs. Kinks: While smooth transitions (C1) are generally preferred, a strict non-zero floor is more important for ensuring recovery success (avoiding permanent saturation) than the speed of recovery.

3. Proposed Solutions

Based on the analysis, the authors propose two new "drop-in" activation functions designed to satisfy the identified criteria:

Strict Non-zero Derivative Floor: Ensures gradients never vanish completely on the negative branch.
Moderate Negative Leak: Targets the $0.6-0.9$ "Goldilocks zone."
C1 Smoothness: A continuous first derivative at the origin to reduce optimization instability.

The New Activations:

Smooth-Leaky: A $C^1$ substitute for Leaky-ReLU. It preserves the linear identity for $x>0$ and a linear leak for $x \ll 0$ , connected by a smooth sigmoid-based transition.
$f(x) = \alpha x + (1 - \alpha) x \cdot \sigma\left(\frac{cx}{p}\right)$
Randomized Smooth-Leaky: A variant where the leak parameter $\alpha$ is sampled from a uniform distribution $U(l, u)$ during the forward pass (and fixed to the mean at inference). This introduces lightweight stochasticity to improve robustness against task-specific local optima.

4. Experimental Results

A. Continual Supervised Learning

Evaluated on five benchmarks (Permuted MNIST, Random Label MN/CIFAR, CIFAR 5+1, Continual ImageNet).

Performance: Randomized Smooth-Leaky consistently achieved the highest Total Average Online Task Accuracy, significantly outperforming ReLU, Sigmoid, Tanh, and even strong baselines like Swish and Deep Fourier Features.
Goldilocks Validation: The best-performing configurations (including RReLU and PReLU) clustered around the identified negative slope range of $[0.6, 0.9]$ .
Robustness: The proposed activations maintained high performance even when combined with standard CL regularization techniques (L2-Init, SNR, EWC), suggesting the activation design itself is a fundamental lever for plasticity.

B. Continual Reinforcement Learning (RL)

Evaluated on a non-stationary MuJoCo sequence (HalfCheetah $\to$ Hopper $\to$ Walker2d $\to$ Ant $\to$ repeat) using PPO.

Metrics: Used Plasticity Score (normalized Interquartile Mean of returns) and Generalization Gap ( $\Delta$ between train and test performance).
Results:
- Randomized Smooth-Leaky achieved the highest Plasticity Score ($0.3875$), outperforming Swish, PReLU, and Sigmoid.
- It demonstrated superior trainability (ability to learn on current data) while maintaining a low Generalization Gap in stable environments (Ant, HalfCheetah), indicating it does not simply overfit recent data.
- While Sigmoid offered stability in volatile environments (Humanoid) due to bounded outputs, it suffered from lower peak plasticity.

5. Key Contributions

Property-Level Analysis: Identified the "Goldilocks zone" for negative slopes and the critical importance of a non-zero derivative floor and low Dead-Band Width for sustaining plasticity.
New Activations: Introduced Smooth-Leaky and Randomized Smooth-Leaky, which offer a lightweight, domain-general solution to plasticity loss without requiring architectural changes or extra capacity.
Stress Protocol: Developed a diagnostic framework (scaling shocks, DBW analysis) to link activation shape directly to adaptation capabilities under distribution shifts.
Empirical Validation: Demonstrated state-of-the-art performance across diverse supervised and RL continual learning benchmarks, proving that activation design is a primary factor in mitigating plasticity loss.

6. Significance

This work challenges the notion that activation functions are merely standard components to be tuned alongside optimizers. It establishes that activation geometry is a fundamental mechanism for preserving plasticity. By providing a simple, drop-in replacement that requires no additional parameters or complex regularization, the proposed methods offer a highly efficient path to building more robust continual learning agents. The findings suggest that future CL research should prioritize activation design as a first-order concern, particularly in non-stationary RL settings where data distributions evolve dynamically.