The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

Here is an explanation of the paper "The Discrete Charm of the MLP" using simple language and everyday analogies.

The Big Idea: It's a Switch, Not a Smooth Curve

Imagine you are trying to describe a complex shape, like a figure-eight (the infinity symbol $\infty$ ), using only a straight ruler.

The Old View (Smooth Function): Most scientists thought the AI's internal "brain" (specifically the MLP layer) was like a master artist trying to draw that figure-eight by smoothing out thousands of tiny, straight lines. They believed the AI was constantly calculating a smooth, continuous curve to approximate the answer.
The New View (Binary Routing): This paper argues that the AI isn't drawing a smooth curve at all. Instead, it's acting like a traffic controller. It looks at a word (a token) and asks a simple Yes/No question: "Does this word need special, complicated help, or can it just pass through?"

If the answer is No, the word goes through a "fast lane" (linear processing). If the answer is Yes, it gets diverted to a "slow lane" (complex nonlinear processing).

The Cast of Characters: The Committee and the Manager

To understand how this works, imagine the AI's brain layer as a large office with 3,000 employees (neurons).

The "Default-On" Committee (7 Employees):
In the later layers of the AI, there is a small group of 7 specific employees who are usually awake and working. They represent the "standard way" of thinking. For most sentences (like "The cat sat on the mat"), these 7 employees agree on what to do. When they agree, the office runs on autopilot. The complex machinery barely turns on because the answer is obvious.
The "Exception Handler" (1 Employee):
There is one special employee, let's call him N2123. He is usually asleep. He only wakes up when the 7 committee members disagree or when the situation is confusing.
- The Magic: The paper found that N2123 and the 7 committee members are almost never awake at the same time (93–98% of the time, they are mutually exclusive). It's a perfect IF/ELSE switch.
- The Switch: If the 7 agree $\rightarrow$ N2123 sleeps (Fast Lane). If the 7 disagree $\rightarrow$ N2123 wakes up and triggers the full, expensive machinery (Slow Lane).

The Analogy: The Airport Security Check

Think of the AI processing a sentence like an airport security line.

The "Smooth" Theory: Imagine security guards trying to calculate a perfect, smooth curve to decide how much time every passenger needs. They are constantly adjusting a dial from 0 to 100.
The "Binary" Reality (This Paper): The guards actually have a simple checklist.
- Passenger A: "I'm just a tourist with a backpack." $\rightarrow$ Guards agree: "Easy." $\rightarrow$ Action: Walk right through the gate. (The heavy machinery stays off).
- Passenger B: "I'm carrying a suspicious package and my ID is blurry." $\rightarrow$ Guards disagree: "Wait, is this dangerous?" $\rightarrow$ Action: The "Exception Handler" (N2123) hits the red button. The passenger is sent to the full-body scanner and a detailed interrogation.

The paper proves that the AI does this exact thing. It doesn't gradually increase the "scanning intensity"; it flips a binary switch.

Why Does This Matter? (The "Why" and "How")

1. The "Smooth" Math Failed
The researchers tried to fit the AI's behavior to smooth mathematical curves (polynomials), like trying to fit a square peg in a round hole. It failed miserably. You can't describe a traffic light with a smooth curve; it's either Red or Green. The AI's "nonlinearity" is just a bunch of traffic lights, not a smooth road.

2. The "Fast Lane" vs. "Slow Lane" Cost
The researchers tested what happens if they turn off the "complex machinery" (the MLP) for different types of words.

When the Committee Agrees (Fast Lane): Turning off the complex machinery barely hurts the AI's performance. It's like turning off the engine of a car while it's coasting downhill; it doesn't matter.
When the Committee Disagrees (Slow Lane): Turning off the machinery causes the AI to crash (perplexity jumps by 43%). This proves that the "Exception Handler" is doing the heavy lifting only when absolutely necessary.

3. The "Developmental Arc"
The paper also looked at how this system grows as the AI gets deeper (from Layer 1 to Layer 12):

Layers 1–3 (The Scaffold): Simple. Just one "gatekeeper" decides if a word needs help.
Layers 4–6 (The Diffuse Zone): A bit messy. No clear gatekeepers; everyone is working a bit.
Layers 7–11 (The Decision Zone): The system crystallizes into the perfect "Committee vs. Exception Handler" switch described above. The deeper the AI goes, the more it relies on this binary voting system to handle complex decisions.

The Takeaway: A Hybrid Machine

The paper concludes that the AI is a hybrid system:

The Signal is Continuous: The information flowing through the wires is still a smooth, analog signal (like electricity).
The Decision is Discrete: The logic of what to do with that signal is digital (0 or 1, Yes or No).

It's like a modern digital watch. The gears inside might be analog, but the logic telling you "It's 3:00 PM" is a discrete, binary decision. The AI uses this "Binary Routing of Continuous Signals" to be efficient: it saves energy by doing the easy math for simple words and only wakes up the super-computer for the hard ones.

In short: The AI isn't trying to be a smooth mathematician. It's a smart traffic cop that knows exactly when to let traffic flow freely and when to stop it for a detailed inspection.

Here is a detailed technical summary of the paper "The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers" by Peter Balogh.

1. Problem Statement

The standard theoretical framework for Transformer MLP (Multi-Layer Perceptron) layers views them as continuous function approximators. Under this "smooth function framing," the MLP is assumed to approximate a smooth, continuous mapping from input to output vectors, partitioning the input space into polytopes where different affine functions apply (based on the piecewise-affine spline theory by Balestriero & Baraniuk, 2018).

The paper challenges this view by asking: Does the piecewise structure of the MLP reflect smooth variation along the data manifold, or does it implement discrete, binary routing decisions? Specifically, the authors investigate whether the MLP decides which tokens require nonlinear processing via a binary switch, while the signals being routed remain continuous.

2. Methodology

The study focuses on GPT-2 Small (124M parameters, 12 layers, 3072 hidden neurons per layer) trained on WikiText-103. The methodology combines polynomial probing, binary feature extraction, and causal ablation.

Polynomial Probing: The authors isolate the "nonlinear residual" ( $\delta$ $δ$ ) of the MLP output (the difference between the actual output and the best linear approximation). They attempt to fit polynomial functions (degrees 2–7) to these residuals.
- Branch Detection: They cluster high-residual tokens using various methods (KMeans, Spectral Clustering, UMAP) to see if subsets of tokens exhibit smooth polynomial structure.
Binary Feature Extraction: Tokens are categorized by their nonlinearity magnitude ( $\|\delta\|$ ). The authors identify neurons whose firing rates shift drastically between "linear-default" tokens (low nonlinearity) and "highly-nonlinear" tokens.
Consensus Analysis: They analyze the co-activation patterns of specific neurons, looking for mutual exclusivity and "quorum" structures (groups of neurons that must agree to trigger a specific path).
Causal Validation (Ablation): To prove the functional importance of these structures, they zero out the MLP output for tokens at specific consensus levels and measure the resulting change in perplexity.

3. Key Contributions & Findings

A. Failure of Smooth Approximation

The authors demonstrate that smooth polynomial approximations fail catastrophically for the nonlinear component of the MLP.

Cross-validated polynomial fits (degrees 2–7) achieve a maximum $R^2$ of only 0.06 for Layer 9 and 0.26 for Layer 11.
Even when clustering tokens to find "smooth subpopulations," no subset yields a generalizable polynomial fit (best validation $R^2$ was 0.021).
Exception: Paragraph boundary tokens ( $\n\n$ ) in Layer 11 show a consistent pattern that can be approximated by a cubic polynomial, but this is a single-condition switch, not a general property of the MLP.

B. Discovery of Binary Routing Architecture

Instead of smooth approximation, the MLP implements binary routing of continuous signals.

The Consensus/Exception Mechanism (Layer 11): The authors identify a specific architecture in Layer 11 consisting of:
- 7 "Default-ON" Neurons: These fire for ~75–99% of tokens (the "linear path").
- 1 "Exception Handler" (Neuron N2123): This neuron is 93–98% mutually exclusive with the 7 default neurons. It fires only when the consensus breaks down.
The Gradient: There is a perfectly monotonic relationship between the number of consensus neurons firing and the MLP's behavior:
- Full Consensus (7/7): The exception handler is silent; the MLP output norm is low (~70), and the MLP acts as noise (linear path).
- Consensus Breakdown (0/7): The exception handler fires for ~95% of tokens; the MLP output norm spikes to ~194 (2.8x increase), triggering full nonlinear computation.
Interpretability: Binarizing the top 8 discriminative neurons reveals interpretable logic resembling pseudocode (e.g., IF N2821 AND NOT (others): apply_function_word_correction). This logic successfully predicts whether a token needs nonlinear processing with 79.2% accuracy (vs. 78.8% for continuous features), proving the routing decision is effectively binary.

C. Developmental Arc Across Layers

The consensus architecture is not uniform; it evolves through the network depth:

Scaffold Layers (L0–L3): Use single "gateway" neurons to route exceptions without a full consensus quorum.
Diffuse Layers (L4–L6): Show no clear gateway or consensus structure; processing is distributed.
Decision Layers (L7–L11): Crystallize into full consensus/exception architectures. The quorum size increases with depth (1 $\to$ 3 $\to$ 7 neurons), and exclusivity becomes near-deterministic.

D. Causal Validation

The structure is functionally critical, not just correlational:

Removing the MLP at consensus breakdown (0/7) causes a 43.3% increase in perplexity.
Removing the MLP at full consensus (7/7) causes only a 10.1% increase.
This 4x difference confirms that the MLP is essential only when the consensus fails (i.e., when the "exception handler" is needed). At full consensus, the MLP's contribution is largely noise.

4. Significance and Implications

Redefining MLP Computation: The paper argues that the MLP is not merely a smooth function approximator but a hybrid system: it performs binary routing of continuous signals. The routing decision (whether to apply nonlinear correction) is discrete, while the signal magnitude (how much correction to apply) remains continuous.
Shannon's Switch Analogy: The authors draw a parallel to Claude Shannon's relay switches. Just as relays use continuous current to implement discrete logic, the MLP uses continuous GELU activations to implement discrete routing decisions. However, unlike Shannon's relays where the signal magnitude was irrelevant, the MLP retains the continuous magnitude for the amount of correction, making it a "binary routing of continuous signals."
Mechanistic Interpretability: This framework offers a new lens for interpretability. Instead of asking "what function does this neuron approximate?", we can ask "what routing condition does this neuron detect?" (e.g., N2123 detects "ambiguous context" or "consensus failure").
Efficiency and Linearization: The findings support "selective linearization." Since the MLP is often unnecessary (acting as noise) when consensus holds, future models could potentially bypass the MLP for these tokens, saving computation without significant performance loss.
Limitations: The clean consensus structure was observed in GPT-2 Small but did not replicate cleanly in GPT-2 Medium or Large models, suggesting that binary routing might be a compression strategy in smaller models or that larger models use more distributed/different architectures.

Conclusion

The paper provides strong empirical evidence that Transformer MLPs in GPT-2 Small operate via a discrete routing mechanism rather than smooth function approximation. The network learns to route tokens through a "fast path" (linear default) or a "slow path" (full nonlinear computation) based on a consensus of specific neurons. This structure is interpretable, causally critical, and fundamentally changes how we should model and optimize Transformer feed-forward layers.