On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions

Imagine you have a super-smart robot chef (the Transformer) that has become famous for writing recipes, translating languages, and even diagnosing diseases. Everyone knows it works incredibly well in practice, but nobody really understands how its brain is built or exactly what kinds of problems it can solve.

This paper is like a team of theoretical chefs who decided to take the robot apart to see how it ticks. They wanted to answer a big question: "Is this robot just a fancy trick, or does it have the raw power to solve any complex math problem?"

Here is the breakdown of their discovery, using some simple analogies.

1. The Robot's Two Main Tools

The robot chef has two main stations in its kitchen:

The "Self-Attention" Station: This is where the robot looks at all the ingredients (words or data points) at once and decides which ones are most important. It's like a chef looking at a whole pantry and saying, "I need the most expensive spice for this dish."
The "Feed-Forward" Station: This is where the robot actually chops and mixes the ingredients for each specific item. It processes one thing at a time.

The paper discovered that these two stations work together in a very specific, powerful way:

The Self-Attention station is secretly a Max-Selector. It's really good at finding the "biggest" or "most important" number among a group.
The Feed-Forward station is a Shape-Shifter. It can stretch, twist, and bend the data into straight lines.

2. The "Maxout" Connection (The Magic Bridge)

The researchers found that the robot's "Self-Attention" station is basically doing a Max operation (finding the highest value).

In the world of math, there is a type of neural network called a Maxout Network. Think of a Maxout Network as a robot that solves problems by constantly asking, "Which of these options is the biggest?" and picking that one.

The paper proves that Transformers can perfectly mimic Maxout Networks.

The Analogy: Imagine you have a Swiss Army Knife (the Transformer). The researchers proved that you can use the Swiss Army Knife to do everything a specialized "Biggest-Number-Finder" tool (the Maxout Network) can do.
Why this matters: Since Maxout Networks are known to be able to solve almost any shape or curve (a property called "Universal Approximation"), this means Transformers can also solve almost any problem. They aren't just good at language; they are mathematically capable of being universal problem solvers.

3. The "Linear Regions" (Folding Paper)

To measure how "smart" or "expressive" a network is, mathematicians count its Linear Regions.

The Analogy: Imagine a piece of paper. If you leave it flat, it has one region. If you fold it once, you have two regions. If you fold it many times, you create a complex, crumpled shape with hundreds of tiny flat surfaces.
A ReLU Network (a standard AI) is like a paper you can fold a few times.
A Transformer is like a paper you can fold exponentially more times just by adding more layers (depth).

The paper shows that as you make a Transformer deeper (add more layers), the number of "folds" (linear regions) it can create grows exponentially. This means a deep Transformer can model incredibly complex, jagged, and detailed shapes that a shallow network simply cannot touch.

4. The Secret Sauce: "Token Shifting"

One of the biggest headaches with Transformers is that they treat every word (token) the same way because they share the same "recipe" (parameters) for all of them. It's like a chef using the exact same knife cut for a tomato and a steak.

The researchers found a clever workaround. Instead of relying on a complex concept called "contextual mapping" (which is like trying to remember every word's history), they introduced a "Token Shift."

The Analogy: Imagine the robot chef puts a different colored hat on every ingredient before chopping it. Even though the chef uses the same knife (the same parameters), the colored hats tell the knife, "Hey, treat this tomato differently than that tomato."
This simple trick allows the Transformer to break its own rules and become much more flexible and powerful.

The Big Takeaway

This paper builds a theoretical bridge between old-school neural networks and modern Transformers.

Proof of Power: It proves that Transformers aren't just lucky; they are mathematically guaranteed to be able to approximate almost any function, just like the best traditional networks.
Why They Are So Good: It explains why they are so good at complex tasks: their self-attention mechanism acts like a powerful "Max" selector, and their depth allows them to create exponentially complex shapes.
Future Directions: Now that we know how they work theoretically, we can start building better, more efficient Transformers and understand exactly where their limits lie.

In short: Transformers are not magic black boxes. They are powerful, mathematically proven machines that use "finding the biggest number" and "folding paper" to solve the world's hardest problems.

1. Problem Statement

While Transformer networks have achieved empirical dominance in NLP and other domains, their theoretical expressive power remains insufficiently understood compared to standard feedforward neural networks (FNNs). Key challenges include:

Parameter Sharing: Transformers share weights across tokens, unlike FNNs where weights are unique per neuron, complicating the analysis of how they represent complex functions.
Restricted Interactions: Token interactions occur solely through pairwise dot products in self-attention, limiting the ability to model arbitrary dependencies without specific mechanisms.
Gap in Theory: Existing universal approximation theorems for Transformers often rely on the concept of "contextual mapping," which aggregates dependencies into token-wise quantities. There is a lack of understanding regarding how Transformers specifically approximate Maxout networks and Continuous Piecewise Linear (CPWL) functions, which are fundamental to understanding the representational capacity of ReLU-based networks.

2. Methodology

The authors develop a framework to bridge the gap between Transformer architectures and Maxout networks by exploiting the intrinsic connection between the self-attention mechanism and the max operation.

Core Insight: The self-attention mechanism, particularly with hardmax or scaled softmax activations, naturally approximates the $\max$ operation. Since Maxout networks are defined by taking the maximum over affine functions, Transformers can theoretically simulate them.
Architectural Construction:
- Positional Embeddings: The authors use a specific linear projection with token-specific shifts to break permutation equivariance. This allows the token-wise feedforward layers to operate on distinct domains for each token, mitigating the limitations of parameter sharing.
- Token-wise Shift: Instead of relying on contextual mapping, they introduce a repeated token-wise shift along the depth of the network. This ensures that token representations move into pairwise disjoint regions, preserving the logic of maxout computations across layers.
- Component Roles: The paper clarifies the distinct roles of Transformer components:
  - Self-Attention: Implements max-type operations (aggregating token-wise affine maps and selecting the maximum).
  - Feedforward Layers: Realize token-wise affine transformations (computing the linear combinations required for the maxout inputs).
Approximation Strategy:
- They construct a 3-layer Transformer to approximate a single Maxout layer.
- They stack these subnetworks to approximate deep Maxout networks.
- They analyze both Hardmax (exact max) and Scaled Softmax (approximate max) variants, proving that Softmax can approximate Hardmax with arbitrary precision by increasing the scaling parameter $\lambda$ .

3. Key Contributions

A. Explicit Approximation of Maxout Networks

The paper provides an explicit construction showing that Transformers can approximate both shallow and deep Maxout networks in the $L_\infty$ norm while maintaining comparable model complexity (parameter count) to the target Maxout network.

Result: A Transformer with $L=3$ layers can exactly represent a single Maxout layer (for rank $p \le T$ ).
Implication: Since Maxout networks strictly generalize ReLU networks, this proves that Transformers possess universal approximation capabilities for ReLU networks under similar complexity constraints.

B. Universal Approximation of CPWL Functions

Building on the Maxout connection, the authors establish that Transformers can approximate any Continuous Piecewise Linear (CPWL) function.

Mechanism: Any CPWL function can be decomposed into the difference of two convex CPWL functions. Each convex CPWL function can be represented as the maximum of finitely many affine functions (a Maxout representation).
Theorem: Transformers can approximate any CPWL function with a prescribed number of linear regions to arbitrary accuracy.

C. Quantitative Characterization via Linear Regions

The paper quantifies the expressive power of Transformers by counting the number of linear regions they can represent.

Exponential Growth: The number of linear regions achievable by a Transformer grows exponentially with network depth.
Lower Bound: The authors derive a lower bound for the number of linear regions $N(\mathcal{F})$ for a Transformer of depth $D$ , showing it scales similarly to deep FNNs, confirming that depth is a critical factor for Transformer expressivity.

D. Structural Insights

The study offers new structural insights into Transformer mechanics:

Self-Attention as Max-Operator: It formally demonstrates that self-attention layers are the primary drivers for implementing non-linear "max" logic.
Feedforward as Affine: Feedforward layers handle the affine transformations.
Shift Mechanism: The introduction of the token-wise shift mechanism provides a novel way to handle parameter sharing without relying on the "contextual mapping" abstraction used in prior works.

4. Key Results

Theorem 3.1 & 3.2 (Maxout Approximation): A Transformer with $L=3D$ layers can exactly approximate a $D$ -layer Maxout network. The required embedding dimension and number of heads scale polynomially with the network width and sequence length, matching the parameter efficiency of the target Maxout network.
Corollary 3.3 (ReLU Universality): Transformers are universal approximators for ReLU networks.
Theorem 4.2 (CPWL Universality): Transformers can approximate any CPWL function with $N$ linear regions.
Theorem 4.4 (Linear Region Complexity): For a Transformer with depth $D$ , the number of linear regions $N(\mathcal{F})$ satisfies a lower bound that grows exponentially with $D$ . Specifically, $N(\mathcal{F}) \ge [mT \frac{q}{T-1} + 1]^{q(\lfloor D/3 \rfloor - 1)} \dots$ , highlighting the power of depth.
Error Bounds: For Softmax-based Transformers, the approximation error is bounded by $O(1/\lambda)$ , where $\lambda$ is the scaling parameter, ensuring that with sufficiently large $\lambda$ , the Softmax version behaves identically to the Hardmax version.

5. Significance

Theoretical Bridge: This work establishes a rigorous theoretical bridge between the approximation theory of standard feedforward networks (specifically Maxout and ReLU networks) and the more complex Transformer architecture.
Redefining Expressivity: It shifts the perspective on Transformer expressivity from "contextual mapping" to the fundamental ability to implement max-type operations and affine transformations, providing a clearer mechanistic understanding of why Transformers are powerful.
Depth vs. Width: The results reinforce that, like FNNs, the depth of a Transformer is the primary driver of its expressive power (exponential growth of linear regions), rather than just width.
Practical Implications: By showing that Transformers can efficiently emulate Maxout networks (which are known for better optimization properties and expressivity than standard ReLU nets), the paper suggests that Transformer architectures are theoretically well-suited for tasks requiring complex piecewise linear decision boundaries.
Future Directions: The paper opens avenues for transferring refined approximation rates (e.g., for Hölder spaces) from FNNs to Transformers and investigating whether pure self-attention models can surpass FNNs in expressivity.

In summary, the paper proves that Transformers are not just empirically successful but are theoretically capable of representing the same class of complex, piecewise linear functions as deep feedforward networks, with their expressive power scaling exponentially with depth through the mechanism of self-attention acting as a max-operator.