When Bias Meets Trainability: Connecting Theories of Initialization

Imagine you are hiring a team of detectives to solve a mystery. Before they even look at a single clue, you have to decide how to set them up. Do you give them a blank notebook and tell them to start with a completely neutral mind? Or do you give them a hunch, a strong gut feeling about who the culprit might be?

For a long time, scientists building Artificial Intelligence (AI) believed that the best way to start a neural network (the "detective team") was to be completely neutral. They thought that if you initialized the network's weights (its internal settings) randomly but fairly, it would have no bias toward any answer, giving it the best chance to learn the truth from the data.

This new paper, presented at ICLR 2026, flips that idea on its head. It argues that the best starting point is actually to be heavily biased.

Here is the breakdown of the paper's discoveries using simple analogies:

1. The Two Ways to Look at a Network

The paper connects two different ways scientists study these networks:

The "Signal" View (Mean-Field Theory): This looks at how information flows through the network. If the signal is too weak, it dies out (vanishing gradients). If it's too strong, it explodes (exploding gradients). The "Goldilocks" zone where the signal is just right is called the Edge of Chaos (EOC).
The "Guessing" View (Initial Guessing Bias - IGB): This looks at what the network thinks before it sees any data. Does it guess "Cat" for every picture? Or does it guess "Dog" for every picture? Or is it truly undecided?

2. The Big Discovery: Bias is the Key to Speed

The authors proved mathematically that these two views are actually the same thing. They found that the "Goldilocks" zone (where the network learns best) is exactly the same place where the network is most biased.

The Analogy of the Overconfident Detective:
Imagine a detective who, before seeing any evidence, is 99% sure the butler did it.

The Old View: "That's bad! They are biased. They won't learn the truth."
The New View: "Actually, that's perfect!"

Why? Because if the detective starts with a strong hunch (bias), they are already "in motion." When they see the first clue, they can quickly adjust their theory.

If they started with zero bias (total neutrality), they are like a detective staring at a blank wall, unsure of where to even begin. They move very slowly.
If they start with extreme bias, they are sprinting in a direction. Even if they are wrong, they are moving fast enough to correct course quickly once the data arrives.

3. The "Deep Prejudice" Phase

The paper introduces a concept called Deep Prejudice. This is when the network is so biased at the start that it assigns almost every input to a single class (e.g., "This is a cat," "This is a cat," "This is a cat").

Surprisingly, the paper shows that the networks that learn the fastest are the ones in this "Deep Prejudice" state.

The Ordered Phase (Too Calm): The network is too neutral. It's like a detective who is too afraid to make a guess. The signal dies out, and the network gets stuck.
The Chaotic Phase (Too Wild): The network is so biased and unstable that it explodes. It's like a detective screaming "The butler did it!" so loudly that they can't hear the clues.
The Edge of Chaos (The Sweet Spot): The network is biased enough to be moving fast, but stable enough to listen to the clues. It starts with a strong prejudice, but as soon as training begins, it absorbs that bias and learns the correct answer rapidly.

4. Why This Matters for Real Life

This changes how we should build and tune AI:

Don't Fear the Bias: If you are tuning a new AI model, don't try to make it perfectly neutral. You actually want it to start with a "prejudice."
The "Warm-Up" Period: When you see an AI model start training, it might look like it's making a lot of mistakes because it's stuck on one answer (the bias). The paper says: Wait! This is normal. The model is just "warming up" its muscles. If you tune the settings to be in the "Edge of Chaos," it will quickly drop that bias and learn the real patterns.
Gradient Imbalance: The paper also notes that because the network is biased, some "classes" (answers) get all the attention while others get ignored initially. This can make training tricky, but it's a sign that the network is in the right, active state.

Summary

Think of training a neural network like teaching a child to ride a bike.

Old Theory: You should hold the bike perfectly still and let the child find their balance from zero.
New Theory: You should give the bike a little push (a bias) so the child is already moving. The child might wobble or lean the wrong way at first, but that momentum allows them to find their balance much faster than if they were standing still.

The takeaway: The best way to start learning is not to be a blank slate, but to have a strong (even if slightly wrong) opinion, and then be ready to change it quickly.

1. Problem Statement

Deep Neural Networks (DNNs) rely heavily on initialization to ensure trainability. Two distinct theoretical frameworks have emerged to analyze the behavior of wide networks at initialization, but they have historically operated in isolation:

Mean-Field (MF) Theory: Focuses on trainability. It analyzes how weight and bias variances ( $\sigma_w^2, \sigma_b^2$ ) affect signal and gradient propagation. It identifies an "Edge of Chaos" (EOC) where gradients are stable, separating an "ordered phase" (vanishing gradients) from a "chaotic phase" (exploding gradients).
Initial Guessing Bias (IGB) Theory: Focuses on predictive states. It analyzes how untrained networks assign probabilities to classes before seeing data. It distinguishes between "neutral" states (balanced class assignment) and "prejudiced" states (where large regions of input space are assigned to a single class).

The Gap: It was previously unclear how these two frameworks relate. A prevailing intuition (supported by Francazi et al., 2024) suggested that the optimal initialization for trainability (the EOC) should be "neutral" (unbiased). This paper challenges that assumption by theoretically proving that the optimal trainability state is actually systematically biased.

2. Methodology

The authors bridge the gap between MF and IGB theories through a rigorous theoretical derivation and empirical validation.

Theoretical Unification:
- They extend the IGB framework to accommodate non-zero bias terms ( $\sigma_b^2 \neq 0$ ) and multi-node activation functions (e.g., pooling layers), which were previously limitations.
- They prove Theorem 3.1, establishing an equivalence between the two frameworks in the infinite-width limit. They show that the MF signal variance ( $q_{aa}$ ) and covariance ( $q_{ab}$ ) map directly to the IGB activation drift variance ( $\sigma_\mu^2$ ) and signal variance ( $\sigma_y^2$ ).
- Key Mapping: The correlation coefficient $c$ in MF is related to the activation drift ratio $\gamma$ in IGB by the equation:
  $c = \frac{\gamma}{1 + \gamma}$
- This implies that the "ordered phase" in MF (where $c \to 1$ ) corresponds to a state of deep prejudice in IGB (where $\gamma \to \infty$ ).
Phase Diagram Analysis:
- They analyze the phase diagrams for bounded (Tanh) and unbounded (ReLU) activation functions.
- They introduce a new classification of phases based on the interplay between gradient stability and bias:
  - Ordered-Deep Prejudice: Vanishing gradients, strong bias.
  - Chaotic-Deep Prejudice: Exploding gradients, strong bias.
  - Transient-Deep Prejudice (EOC): Stable gradients, strong bias that is rapidly absorbed during training.
Empirical Validation:
- Experiments were conducted on various architectures (MLP, Residual MLP, Vision Transformers) and datasets (Fashion MNIST, CIFAR-10/100, ImageNet-pretrained models).
- They measured global accuracy, maximum classification frequency (bias), and layer-wise gradients across different initialization hyperparameters.

3. Key Contributions

Theoretical Equivalence: Proved that MF trainability conditions and IGB predictive biases are mathematically equivalent descriptions of the same phenomenon in wide networks.
Counterintuitive Conclusion: Demonstrated that the optimal initialization for trainability (the Edge of Chaos) is not neutral, but rather a state of "transient deep prejudice." The network starts with a strong bias toward a specific class, which is then rapidly "reabsorbed" (corrected) during the early stages of learning.
Refined Phase Diagrams: Corrected existing inaccuracies in MF phase diagrams (e.g., for ReLU, showing that the correlation coefficient $c$ converges to 1 everywhere, implying persistent prejudice, but distinguishing phases via gradient stability and divergence rates).
Per-Class Gradient Analysis: Revealed that in the chaotic phase of unbounded activations (like ReLU), gradient exploding is class-dependent. The "favored" class (the one the network is biased toward) often has numerically zero gradients, while unfavored classes explode, leading to severe training imbalances.
Generalization: Extended IGB theory to include non-zero biases and multi-node operations (MaxPool, AveragePool), showing how these layers shift the Edge of Chaos.

4. Key Results

The EOC is Biased: In experiments, models initialized at the Edge of Chaos exhibited the highest initial classification frequency (strongest bias) but also the fastest learning dynamics. Conversely, "neutral" initializations (low bias) resulted in poor performance and slow learning.
Gradient Imbalance: In the chaotic phase, the network's bias causes the softmax output to concentrate entirely on one class. This leads to a situation where gradients for the favored class vanish (loss is zero), while gradients for other classes explode, destabilizing training.
ReLU Specifics: For ReLU networks, the correlation coefficient $c$ converges to 1 across the entire phase diagram (unlike Tanh). However, the rate of convergence and the behavior of the signal variance distinguish the ordered phase (exponential convergence, vanishing gradients) from the chaotic phase (power-law convergence, exploding gradients).
Pre-trained Models: The phenomenon holds for pre-trained models fine-tuned on new datasets. Rescaling weights to move away from the EOC (either into the ordered or chaotic phase) degraded performance, confirming that the "optimal" state is a specific biased configuration.

5. Significance and Practical Implications

Redefining "Good" Initialization: The paper overturns the intuition that a "neutral" start is best. It suggests that hyperparameter tuning should aim for the EOC, even if it results in strong initial bias, because this bias is a necessary precursor to stable gradient propagation.
Training Dynamics: The "transient" nature of the bias at the EOC explains why models can learn quickly despite starting with a skewed view of the data. The bias is not a bug but a feature of the optimal initialization.
Hyperparameter Tuning: Short training runs used for hyperparameter tuning may be misleading. If a model is initialized in a biased state that hasn't been reabsorbed yet, it might appear to perform poorly or favor specific classes artificially. Tuning runs must be long enough to allow the network to absorb the initial prejudice.
Gradient Stability: Understanding that gradient exploding is class-dependent helps explain training instabilities in deep networks with unbounded activations. It suggests that monitoring per-class gradients is crucial for diagnosing initialization issues.

In summary, this work unifies two major theories of deep learning initialization, revealing that bias is not an obstacle to trainability but a fundamental requirement for it in wide networks, provided the bias is transient and the gradients remain stable.

When Bias Meets Trainability: Connecting Theories of Initialization

1. The Two Ways to Look at a Network

2. The Big Discovery: Bias is the Key to Speed

3. The "Deep Prejudice" Phase

4. Why This Matters for Real Life

Summary

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Practical Implications

More like this

IC3-Evolve: Proof-/Witness-Gated Offline LLM-Driven Heuristic Evolution for IC3 Hardware Model Checking

Structural Segmentation of the Minimum Set Cover Problem: Exploiting Universe Decomposability for Metaheuristic Optimization

To Throw a Stone with Six Birds: On Agents and Agenthood

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Toward Full Autonomous Laboratory Instrumentation Control with Large Language Models