Activation Functions, Statistics and Learning of… — Plain-Language Explanation

Imagine you are trying to teach a computer to recognize complex patterns in data, like spotting a specific face in a crowd or understanding the mood of a song. To do this, the computer uses a "brain" made of layers of simple units. One popular type of this brain is called a Restricted Boltzmann Machine (RBM).

Think of an RBM as a two-story building:

The Ground Floor (Visible Units): This is where the data lives (the pictures, the sounds, the numbers).
The Second Floor (Hidden Units): This is where the "thinking" happens. These units look at the ground floor and try to figure out the hidden rules connecting the data points.

The big question this paper asks is: How does the "personality" of the second-floor units affect what the computer learns?

In technical terms, this "personality" is called the activation function. It's a rule that decides how strongly a unit reacts to the information it receives. The authors tested four different "personalities":

Linear: A gentle, straight-line reaction.
Step: An on/off switch (like a light switch).
ReLU: A "rectified" switch that ignores negative inputs but lets positive ones through.
Exponential: A unit that explodes in reaction strength as soon as it gets a little input.

The Core Discovery: Simple vs. Complex Relationships

The paper reveals that the choice of this "personality" changes the kinds of relationships the computer can easily understand.

The "Simple" Personalities (Linear, Step, ReLU):
Imagine these units are like people who only care about pairs. If you have a group of friends, a "Step" or "ReLU" unit is great at noticing that "Alice and Bob always hang out together." It's good at finding simple, two-person connections. However, it struggles to understand complex group dynamics, like "Alice, Bob, and Charlie only hang out together if Dave is also there." These complex, multi-person rules (called higher-order interactions) tend to get lost or become very weak in the computer's memory.

The "Explosive" Personality (Exponential):
Now, imagine a unit that reacts wildly to input. The authors found that if you use this Exponential function, the computer becomes much better at understanding those complex group dynamics. It can easily learn that "Alice, Bob, and Charlie" have a special bond that doesn't exist without them all being present.

The "Sea of Simplicity" vs. The "Island of Complexity"

The authors used a clever analogy involving a vast ocean to explain their findings:

The Ocean of Simple Models: For most activation functions (like ReLU or Step), the computer's "natural state" is a sea of simple, decaying relationships. If you throw a random set of weights (random connections) at the computer, it will almost always end up learning simple pairs. Complex rules are like rare islands in this ocean; they are so hard to find that the computer rarely stumbles upon them by accident.
The Island of Complexity: However, with the Exponential function, the landscape changes. There is a specific "region" of parameters (a specific way of setting the computer's initial settings) where the computer naturally floats in a sea of complex, non-decaying relationships. In this zone, complex group rules are just as common as simple pairs.

What Happens When You Train the Computer?

The researchers then simulated training these computers on different types of data to see what happened.

Learning Simple Data: When they trained the computer on data with simple rules (just pairs), all types of activation functions worked well. They all learned the simple rules effectively.
Learning Complex Data: When they trained the computer on data with complex, multi-person rules:
- Linear, Step, and ReLU: The computer failed to learn the complex rules. Instead, it tried to force a simple explanation onto the complex data. It essentially "gave up" on the group dynamics and just learned the individual parts, missing the big picture.
- Exponential: The computer succeeded. Because its natural state allowed for complex rules, it was able to learn and reproduce the intricate group dynamics of the data.

The "Simplicity Bias"

The paper concludes that neural networks have a built-in "simplicity bias." They naturally prefer to learn simple, low-level connections first. This is usually a good thing, but it means they struggle with data that is fundamentally complex.

The key takeaway is that by choosing the Exponential activation function, you can break this bias. You can tune the computer so that it is naturally open to learning complex, high-order patterns that other types of networks would simply ignore or fail to represent.

In short: If you want your AI to understand simple pairs, almost any "personality" works. But if you want it to understand complex group dynamics, you need to give it the "Exponential" personality, which makes the computer naturally capable of seeing the whole picture, not just the pieces.

Technical Summary: Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines

Problem Statement
While neural networks are widely recognized for their ability to recognize hidden patterns through the combination of numerous parameters and nonlinear activation functions, the specific impact of the form of the hidden unit activation function on network performance and representational capacity remains under-explored theoretically. Although empirical evidence suggests that nonlinearities like ReLU improve convergence and performance compared to sigmoidal units, a systematic theoretical assessment of how different activation functions influence the statistical regularities an RBM can represent is lacking. Specifically, it is unclear how the choice of activation function affects the RBM's ability to learn and represent data structures characterized by strong higher-order interactions (interactions beyond pairwise).

Methodology
The authors exploit the duality between Restricted Boltzmann Machines (RBMs) and models of interacting binary variables. By marginalizing over the hidden units, an RBM can be mapped exactly to a model where visible units interact directly with terms of arbitrary order $s$ . The interaction terms $I_{i_1, \dots, i_s}$ are analytically expressed as a function of the hidden layer's nonlinearity and the weights connecting hidden and visible units.

The study proceeds in two main analytical phases:

Exact Statistical Analysis: For Linear and Exponential (Poisson) activation functions, the authors derive exact analytical expressions for the expected values and correlations (moments) of the induced interaction terms when weights are drawn from a Gaussian distribution.
Small Fluctuation Expansion: For Step (Sigmoid) and ReLU activation functions, where exact solutions are more complex, the authors employ a second-order expansion of the interaction terms around the mean weight $w_0$ . This approximation allows for the computation of expectations and variances for these nonlinearities.

These analytical predictions are validated against numerical simulations of training processes on specific ground-truth distributions, including decaying interaction models (where interaction strength decreases with order) and non-decaying models (where higher-order interactions are significant).

Key Contributions and Results

Characterization of Interaction Spaces: The paper analytically characterizes the space of representable models for four activation functions: Linear, Step, ReLU, and Exponential.
- Linear RBMs: Only produce non-zero pairwise interactions (fields and pairwise terms); all higher-order interactions are zero.
- Exponential RBMs: Exhibit a rich interaction structure where higher-order terms are non-zero. Crucially, the expected value of interaction terms can increase exponentially with the interaction order $s$ if the parameter $\gamma_1 > 1$ (a condition determined by the mean and variance of the weights).
- Step and ReLU RBMs: While they produce higher-order interactions, the analysis shows that lower-order interactions generally dominate, and the magnitude of interactions typically decays with order.
Fluctuation Analysis: The study identifies regimes where fluctuations in interaction terms exceed their expected values. For Exponential activation, there exists a parameter region where fluctuations for higher-order interactions are larger than those for lower-order interactions, a phenomenon not observed in Linear, Step, or ReLU cases.
Learning Dynamics and "Decaying" vs. "Non-Decaying" Models:
- The authors define decaying models as those where the magnitude of interactions decreases with order, and non-decaying models where this is not true.
- General Finding: In the weak coupling regime, RBMs trained on various data tend to converge to decaying interaction models, regardless of the activation function. This suggests a "simplicity bias" where the learning process favors lower-order features.
- Exponential Exception: In specific parameter regimes (large mean weight $w_0$ or large weight variance), RBMs with Exponential activation functions enter a non-decaying regime. In this regime, the ensemble contains a significant fraction of models where higher-order interactions are comparable to or larger than lower-order ones.
- Training Performance: When trained on ground-truth data with strong non-decaying (e.g., pure three-body) interactions:
  - RBMs with Step, ReLU, or Linear activations fail to reconstruct the non-decaying structure, effectively learning the data as a decaying model (approximating higher-order terms with lower-order ones).
  - RBMs with Exponential activation successfully reconstruct the non-decaying interaction structure and achieve significantly lower Kullback-Leibler (KL) divergence, provided the parameters are within the analytically determined non-decaying regime.

Significance and Claims
The paper claims that the choice of activation function is a critical design parameter that dictates the "representational bias" of an RBM.

Theoretical Insight: The work provides a theoretical framework showing that rapidly increasing nonlinearities, specifically the Exponential function, can facilitate the representation and learning of data structures with large higher-order interaction terms. This is achieved by shifting the statistical ensemble of the RBM from a decaying to a non-decaying regime.
Simplicity Bias: The results suggest that the "simplicity bias" observed in neural networks (the tendency to learn low-order features first) may arise not only from the learning algorithm (e.g., stochastic gradient descent) but also from the inherent representational bias introduced by the activation function. Most standard activation functions (ReLU, Step) inherently favor low-order interactions.
Practical Implication: For tasks involving data with complex, high-order correlations, the Exponential activation function offers a theoretical advantage over standard nonlinearities, provided the model parameters are tuned to the specific regime where non-decaying interactions are stable.

The authors conclude that while their analysis relies on random ensembles and specific ground truths, it offers a principled basis for understanding how activation functions shape the representational landscape of RBMs, potentially guiding the design of architectures for tasks requiring the capture of high-order statistical regularities.

Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines

The Core Discovery: Simple vs. Complex Relationships

The "Sea of Simplicity" vs. The "Island of Complexity"

What Happens When You Train the Computer?

The "Simplicity Bias"

Technical Summary: Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines

More like this