Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

Imagine you run a massive, high-tech factory that produces the next word in a sentence. This factory is called a Transformer, and it's the engine behind AI like the one you're talking to right now.

Inside this factory, there are thousands of workers (layers) who process information. At every single station, there is a special team called the MLP (Multilayer Perceptron). These workers are famous for being incredibly complex. They take a piece of information, twist it, turn it, and apply a complicated, non-linear "magic spell" to it before passing it on.

The big assumption in the AI world has always been: "We need every single one of these complex magic spells. If we stop using them, the factory will collapse."

This paper, titled "Half the Nonlinearity Is Wasted," is like a detective story where the author walks into the factory, watches the workers, and says: "Actually, you're wasting about half your budget. Most of the time, these workers are just doing simple math, but they're dressed up in expensive suits."

Here is the breakdown of the findings using simple analogies:

1. The "Wanamaker" Problem

The author starts with a famous quote from an old advertising executive: "Half the money I spend on advertising is wasted; the trouble is I don't know which half."

The author argues that in AI, we do know which half is wasted. In many models (specifically the GPT-2 family), about 40% to 70% of the complex calculations are unnecessary. The workers are often just doing simple multiplication and addition, but the factory forces them to do a full, complicated dance every single time.

2. The "Gatekeeper" Experiment

To prove this, the author built a tiny, cheap Gatekeeper (a simple decision-maker) for every station in the factory.

The Job: Before a worker does their complex dance, the Gatekeeper looks at the incoming information and asks: "Do you really need the full complex dance, or is a simple calculation enough?"
The Result: The Gatekeeper could send about 40% of the work to a "Simple Mode" (just a basic math formula) without the factory making any mistakes. In fact, at some stations, forcing the workers to do the simple math actually made the factory better because it stopped them from overthinking and making up nonsense.

3. The Big Misconception: "It's Not About the Word"

The most surprising part of the story is how the Gatekeeper decides.

The Wrong Guess: The author first thought the Gatekeeper was looking at what the word was.
- Analogy: "Oh, if the word is 'the' (a common word), we can use simple math. If it's 'elephant' (a complex word), we need the full dance."
- The Reality: This was wrong. The Gatekeeper failed miserably when tested on new books or different topics. The word "the" sometimes needed a complex dance and sometimes didn't, depending entirely on the sentence.
The Right Answer: The Gatekeeper was actually looking at the context (the story so far).
- Analogy: Imagine you are reading a mystery novel. The word "bank" could mean a river bank or a money bank.
  - If the sentence is "He sat on the river bank," the context is clear. The factory can use "Simple Mode."
  - If the sentence is "He robbed the bank," the context is tricky. The factory needs "Complex Mode."
- The Gatekeeper isn't checking the word itself; it's checking what the previous workers have already figured out. It's a contextual judgment, not a dictionary lookup.

4. The "Factory Layout" Matters

The author tested two different factory designs: GPT-2 and Pythia.

GPT-2 (The Efficient Factory): This design is very "linear." It's like a conveyor belt where the workers mostly just pass things along. You can replace half the complex workers with simple ones and the factory runs smoother.
Pythia (The Chaotic Factory): This design is more "parallel." The workers are trying to do more independent, complex work. It's harder to simplify them. However, even here, the middle of the factory was found to be mostly doing simple work, while the very beginning and very end of the line needed the full complexity.

5. The "Surgery" Experiment

To prove this wasn't just a trick, the author performed surgery on the factory.

They took a trained model and permanently replaced the middle workers with simple, frozen math formulas.
Result: The factory didn't just survive; it got better. By removing the "overthinking" workers in the middle, the remaining workers had to focus harder, and the whole system became more efficient and accurate.
With a bit of extra training, the "simplified" model beat the original complex model by a significant margin (17% better).

The Takeaway

The paper concludes that we have been overpaying for complexity.

The Myth: Every word needs a complex, nonlinear brain.
The Truth: Most words, in most contexts, just need a simple nudge. The complex "brain" is only needed for the rare, tricky moments where the context is confusing.

The Future: Instead of building every factory station with the same expensive, complex machinery, we should build smart factories.

Put the super-complex, expensive machinery at the entrance (to understand the start of the sentence) and the exit (to make the final decision).
Use cheap, simple, linear machinery for the middle of the sentence, where things are usually straightforward.

This would save massive amounts of computing power (money and energy) and could lead to smarter, faster AI in the future.

Here is a detailed technical summary of the paper "Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget" by Peter Balogh.

1. Problem Statement

The paper challenges the fundamental assumption in Transformer architecture design: that the nonlinearity provided by Multi-Layer Perceptrons (MLPs) at every layer and token position is essential for language modeling.

The Hypothesis: It is widely believed that removing MLP nonlinearity collapses the network into a linear map incapable of learning complex language functions.
The Question: Is this nonlinearity actually necessary for every computation? Can a significant portion of MLP computations be replaced by linear approximations without degrading performance, or even by improving it?
The Goal: To quantify the "wasted" nonlinearity budget, determine if it can be dynamically routed (gated), and understand what drives the need for nonlinearity (token identity vs. context).

2. Methodology

The authors employed a systematic approach across six models (ranging from 162M to 2.8B parameters) spanning two architectures (GPT-2 and Pythia/GPT-NeoX) and three corpora.

A. Linear Approximation & All-Linear Evaluation

Surrogate Fitting: For each layer, they fitted a closed-form linear surrogate ( $\hat{f}(x) = Wx + b$ ) to the original MLP using ridge regression on 10,000 activation vectors.
Cost Measurement: They replaced the MLP with this linear surrogate and measured the perplexity (PPL) increase. This established the baseline "cost" of full linearization per layer.

B. Adaptive Gating

Mechanism: Instead of wholesale replacement, they trained a lightweight gate ( $g_\theta$ ) to decide per-token whether to use the full MLP or the linear surrogate.
Gate Architecture: The most effective gate was a simple logistic classifier with $d+1$ parameters (a single hyperplane in activation space).
Training: The gate was trained as a binary classifier to predict the "delta" ( $\delta = L_{lin} - L_{full}$ ), labeling positions where the linear path was sufficient as "linear OK."

C. Probing the Routing Signal

To understand why the gate makes decisions, the authors decomposed the MLP input into:

Token Identity ( $e_i$ ): The embedding of the current token.
Contextual Contribution ( $c_i$ ): The residual stream contribution from attention and previous layers ( $x_i - e_i$ ).
They trained separate gates on these components and tested cross-corpus stability (e.g., training on WikiText, testing on LAMBADA) to see if nonlinearity need is a property of the token or the context.

D. Progressive Linearization & Fine-Tuning

Progressive Replacement: They progressively replaced middle-layer MLPs with frozen linear matrices, fine-tuning the remaining layers after each step.
Two-Phase Gating: A final experiment combined linearization with learned gating during fine-tuning to optimize the allocation of nonlinear capacity.

3. Key Contributions

Quantification of MLP Linearity: Systematic measurement showing that a substantial fraction of MLP computations are near-linear. In GPT-2 Medium, 70% of layers can be fully linearized with <3% PPL cost.
Minimal Adaptive Gating: Demonstrated that a simple $(d+1)$ -parameter linear classifier can route 25–56% of activations to a linear path with negligible cost. Larger, complex gates provided no additional benefit.
Strong Negative Result on Token-Based Routing: Proved that nonlinearity need cannot be predicted from token identity. Cross-corpus correlation for "No-Fly" lists (tokens that always need nonlinearity) was effectively zero ( $r < 0.05$ ).
Context Dominance: Showed that the routing decision is entirely contextual. The "context-only" gate matched the full gate's performance, while the "token-only" gate performed near random chance.
Negative Regularization: Documented that at 4 of 23 GPT-2 Medium layers, the linear approximation improved perplexity over the full MLP, suggesting the original MLP was overfitting (acting as a negative regularizer).
Architecture Dependence: Revealed a stark divide: GPT-2 models are highly linearizable, while Pythia models (GPT-NeoX) have higher linearization costs, though this gap narrows at larger scales (2.8B).

4. Key Results

Linearization Costs:
- GPT-2 Medium: Layers 2–15 can be fully linearized with a cost of 1.6–2.5%. Four layers actually improved performance when linearized.
- GPT-2 Large: 11 of 36 layers beat the baseline when linearized; no layer exceeded a 3.7% cost.
- Pythia Models: Generally higher costs. However, at 2.8B scale, Layer 3 narrowly beat the baseline (−0.13%), and middle layers (L7–L15) were cheap to linearize (<4%).
- Critical Layers: The first layer (L0) in Pythia-2.8B is catastrophic to linearize (+513% PPL), whereas GPT-2 L0 is cheap (+1.77%).
Gating Performance:
- Achieved 25–56% linear routing at <1% PPL cost in GPT-2.
- In GPT-2 Medium, gating at Layer 6 improved PPL by 0.06%.
- Compound Gating: Replacing all layers simultaneously with the best gate saved ~21% of total forward-pass FLOPs with minimal PPL impact.
Progressive Linearization (Proof of Concept):
- Replacing 5 middle layers (L10–L14) with frozen linear matrices resulted in zero PPL cost after minimal fine-tuning.
- With a full training budget (117.9M tokens) on 4 linearized layers, the model achieved a 10.2% PPL improvement over the baseline.
- A two-phase gated approach (linearization + learned gating) achieved a 17.3% PPL improvement (19.00 PPL), beating a vanilla fine-tuning control.
Routing Mechanism:
- The gate works because the distribution of "nonlinearity need" ( $\delta$ ) is heavily skewed; most tokens are near-linear. The gate only needs to catch the thin tail of outliers.
- The decision is based on the contextual contribution to the residual stream, not the token embedding.

5. Significance and Implications

Architectural Redesign: The paper suggests that future Transformers should not allocate uniform nonlinear capacity. Instead, they should concentrate nonlinear parameters at boundary layers (input/output) and use linear or hybrid layers in the middle.
Efficiency: Token-based routing (e.g., "skip MLP for function words") is a dead end. Efficient inference must rely on context-based gating or architectural changes that inherently reduce nonlinearity in the middle layers.
Negative Regularization: The finding that removing nonlinearity can improve performance suggests that standard MLPs in middle layers often overfit. Reducing capacity in these layers acts as a regularizer.
Hybrid Architectures: The authors propose "Hybrid Linear-Nonlinear Layers" where a small nonlinear MLP handles the "tail" of difficult cases, while a linear projection handles the bulk of the computation.
Sequential vs. Parallel: The difference in linearizability between GPT-2 (sequential) and Pythia (parallel) suggests that sequential architectures allow MLPs to be more redundant (linear), while parallel architectures force MLPs to do more independent nonlinear work.

Conclusion: The paper demonstrates that the "nonlinearity budget" in Transformers is vastly over-allocated. By measuring, gating, and reallocating this budget, it is possible to create models that are not only more efficient but, in some cases, more accurate than their original counterparts.