Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

This paper demonstrates that a significant portion of transformer MLP nonlinearity is redundant and context-dependent, showing that a lightweight gating mechanism can dynamically replace these computations with linear surrogates to reduce computational waste or, when applied strategically with full retraining, actively improve model performance by eliminating harmful nonlinearities.

Peter Balogh

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you run a massive, high-tech factory that produces the next word in a sentence. This factory is called a Transformer, and it's the engine behind AI like the one you're talking to right now.

Inside this factory, there are thousands of workers (layers) who process information. At every single station, there is a special team called the MLP (Multilayer Perceptron). These workers are famous for being incredibly complex. They take a piece of information, twist it, turn it, and apply a complicated, non-linear "magic spell" to it before passing it on.

The big assumption in the AI world has always been: "We need every single one of these complex magic spells. If we stop using them, the factory will collapse."

This paper, titled "Half the Nonlinearity Is Wasted," is like a detective story where the author walks into the factory, watches the workers, and says: "Actually, you're wasting about half your budget. Most of the time, these workers are just doing simple math, but they're dressed up in expensive suits."

Here is the breakdown of the findings using simple analogies:

1. The "Wanamaker" Problem

The author starts with a famous quote from an old advertising executive: "Half the money I spend on advertising is wasted; the trouble is I don't know which half."

The author argues that in AI, we do know which half is wasted. In many models (specifically the GPT-2 family), about 40% to 70% of the complex calculations are unnecessary. The workers are often just doing simple multiplication and addition, but the factory forces them to do a full, complicated dance every single time.

2. The "Gatekeeper" Experiment

To prove this, the author built a tiny, cheap Gatekeeper (a simple decision-maker) for every station in the factory.

  • The Job: Before a worker does their complex dance, the Gatekeeper looks at the incoming information and asks: "Do you really need the full complex dance, or is a simple calculation enough?"
  • The Result: The Gatekeeper could send about 40% of the work to a "Simple Mode" (just a basic math formula) without the factory making any mistakes. In fact, at some stations, forcing the workers to do the simple math actually made the factory better because it stopped them from overthinking and making up nonsense.

3. The Big Misconception: "It's Not About the Word"

The most surprising part of the story is how the Gatekeeper decides.

  • The Wrong Guess: The author first thought the Gatekeeper was looking at what the word was.

    • Analogy: "Oh, if the word is 'the' (a common word), we can use simple math. If it's 'elephant' (a complex word), we need the full dance."
    • The Reality: This was wrong. The Gatekeeper failed miserably when tested on new books or different topics. The word "the" sometimes needed a complex dance and sometimes didn't, depending entirely on the sentence.
  • The Right Answer: The Gatekeeper was actually looking at the context (the story so far).

    • Analogy: Imagine you are reading a mystery novel. The word "bank" could mean a river bank or a money bank.
      • If the sentence is "He sat on the river bank," the context is clear. The factory can use "Simple Mode."
      • If the sentence is "He robbed the bank," the context is tricky. The factory needs "Complex Mode."
    • The Gatekeeper isn't checking the word itself; it's checking what the previous workers have already figured out. It's a contextual judgment, not a dictionary lookup.

4. The "Factory Layout" Matters

The author tested two different factory designs: GPT-2 and Pythia.

  • GPT-2 (The Efficient Factory): This design is very "linear." It's like a conveyor belt where the workers mostly just pass things along. You can replace half the complex workers with simple ones and the factory runs smoother.
  • Pythia (The Chaotic Factory): This design is more "parallel." The workers are trying to do more independent, complex work. It's harder to simplify them. However, even here, the middle of the factory was found to be mostly doing simple work, while the very beginning and very end of the line needed the full complexity.

5. The "Surgery" Experiment

To prove this wasn't just a trick, the author performed surgery on the factory.

  • They took a trained model and permanently replaced the middle workers with simple, frozen math formulas.
  • Result: The factory didn't just survive; it got better. By removing the "overthinking" workers in the middle, the remaining workers had to focus harder, and the whole system became more efficient and accurate.
  • With a bit of extra training, the "simplified" model beat the original complex model by a significant margin (17% better).

The Takeaway

The paper concludes that we have been overpaying for complexity.

  • The Myth: Every word needs a complex, nonlinear brain.
  • The Truth: Most words, in most contexts, just need a simple nudge. The complex "brain" is only needed for the rare, tricky moments where the context is confusing.

The Future: Instead of building every factory station with the same expensive, complex machinery, we should build smart factories.

  • Put the super-complex, expensive machinery at the entrance (to understand the start of the sentence) and the exit (to make the final decision).
  • Use cheap, simple, linear machinery for the middle of the sentence, where things are usually straightforward.

This would save massive amounts of computing power (money and energy) and could lead to smarter, faster AI in the future.