Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations

Imagine you are trying to teach a robot to draw a perfect, smooth curve, like the arc of a rainbow or the flow of a river. You have two types of "pens" (activation functions) to choose from:

The "Lego Brick" Pen (ReLU): This is the most popular pen in the world right now. It's great at building sharp corners and straight lines. It's like stacking Lego bricks; you can build a staircase that looks like a curve if you make the steps tiny enough. But to get a really smooth curve, you need a lot of layers of bricks stacked on top of each other.
The "Ink" Pen (Smooth Activations): This is the pen used in many modern, high-end AI models (like the ones writing this explanation). It draws with a continuous, flowing ink. It doesn't have sharp corners; it just flows.

The Big Question:
For a long time, scientists thought: "To draw a super-smooth curve, you need a very deep stack of Lego bricks (a deep neural network)." They believed that if you wanted to capture high levels of smoothness, you had to make the network deeper and deeper.

The Discovery:
This paper says: "Actually, you don't need the deep stack of bricks if you use the right ink."

Here is the breakdown of their findings using simple analogies:

1. The "Width vs. Depth" Trade-off

Think of a neural network as a factory assembly line.

Depth is the number of stations on the line.
Width is how many workers are at each station.

The Old Way (Lego/ReLU):
If you are using the "Lego" pen, your factory is limited. If you only have a short assembly line (constant depth), no matter how many workers you add (width), you can never make a perfectly smooth curve. You can only make a "staircase" that approximates a curve. To get a smoother curve, you must build a taller factory (increase depth).

The New Way (Smooth/Ink):
The authors prove that if you use a "Smooth" pen (like GELU or SiLU), you can keep the assembly line short (constant depth). You don't need to build a skyscraper of a factory. Instead, you just need to hire more workers (increase the width). With enough workers, a short line can draw a curve that is mathematically perfect, matching the smoothness of the target function.

2. The "Adaptability" Superpower

Imagine you are a tailor making suits.

The Lego Tailor: If a customer wants a suit with a simple shape, a short line of tailors works. But if the customer wants a suit with incredibly complex, smooth folds, the Lego tailor says, "I need to hire more tailors, but I also need to build a whole new floor in my factory." The complexity forces the factory to grow vertically.
The Ink Tailor: This tailor uses a magical smooth pen. No matter how complex or smooth the customer's request is, the tailor just says, "I'll just add more workers to my current small workshop." The factory stays the same size; only the workforce grows.

This is called Smoothness Adaptivity. The smooth activation functions allow the network to automatically adjust to how "smooth" the data is without needing to change the architecture's depth.

3. Why This Matters (The "Why Should I Care?")

Efficiency: Building deeper networks is hard. It requires more memory, more computing power, and is harder to train. If we can get the same (or better) results with a shallow network that is just wider, we save a massive amount of energy and money.
No "Sparsity" Tricks: Previous theories said, "You can only get these results if you force the network to be 'sparse' (meaning most of the workers are actually sleeping and doing nothing)." This paper shows you don't need those tricks. You can use a fully active, wide network and still get the best results.
Real-World Proof: The authors didn't just do math; they ran experiments. They trained simple, shallow networks with smooth pens (like GELU and Tanh) and compared them to the standard Lego pen (ReLU). The smooth pens learned the smooth curves faster and with less data.

The Bottom Line

For years, the AI community believed that Depth was the magic key to learning complex, smooth patterns. This paper flips the script. It shows that Smoothness in the activation function is the real magic key.

If you want to learn a smooth function, don't just build a deeper tower; just switch to a smoother pen and widen your net. You get the same (or better) results with a simpler, shallower structure. It's like realizing you don't need a 10-story ladder to reach the sky; you just need a really good trampoline.

1. Problem Statement

The paper addresses a fundamental gap in deep learning theory regarding the role of activation function smoothness. While smooth activations (e.g., GELU, SiLU, SwiGLU) are ubiquitous in modern architectures (Transformers, Diffusion models), theoretical understanding of their advantages over non-smooth counterparts (like ReLU) remains limited.

Specifically, the authors investigate whether neural networks can achieve smoothness adaptivity—the ability to attain optimal approximation and estimation rates for functions with arbitrary smoothness $s > 0$ (in Sobolev spaces $W^{s,\infty}([0,1]^d)$ )—without increasing network depth.

The Conflict: Existing theory for ReLU networks suggests that achieving optimal rates for high-smoothness functions requires network depth to grow with $s$ or the target accuracy.
The Question: Can constant-depth networks equipped with smooth activations achieve optimal rates solely by increasing width, without intractable sparsity constraints?

2. Methodology

The authors employ a constructive approximation framework to build explicit neural network approximators with controlled complexity. Their approach involves three main technical pillars:

A. Multi-Scale Approximation Framework

To approximate functions in $W^{s,\infty}$ , the authors decompose the target function into:

Piecewise Polynomials: Using the Bramble-Hilbert lemma, the target function is locally approximated by polynomials of degree $\lceil s \rceil - 1$ on a refined grid.
Monomial Approximation: They construct shallow networks to approximate monomials $x^\alpha$ using finite-difference schemes based on the Taylor expansion of the smooth activation function $\phi$ .
Piecewise Constant Functions: A novel multi-scale decomposition is used to approximate piecewise constant functions. Instead of assigning a neuron to every refined cell (which would lead to $O(K^{2d})$ width), they decompose the function into a sum of coarse-grid indicators multiplied by refined-grid indicators. This reduces the required width to $O(K^d)$ .

B. Weighted Superposition Principle (for $L^\infty$ )

To extend $L^2$ approximation to uniform $L^\infty$ approximation, the authors introduce a weighted superposition mechanism:

They construct a family of local approximators $\{g_v\}$ that are accurate on shifted interior regions but may have large errors on boundary "band" regions.
They construct smooth weight functions $\{w_v\}$ that form a partition of unity and vanish locally on the band regions where the local approximators are inaccurate.
The global approximator is formed by the weighted sum $\sum g_v w_v$ . This suppresses errors in the band regions, ensuring a global $L^\infty$ bound.

C. Complexity Control

Crucially, the construction explicitly controls the parameter norms ( $\ell_\infty$ and $\ell_2$ ) and the model size (width). This avoids the need for $\ell_0$ -sparsity constraints (forcing many weights to be exactly zero), which are common in prior theoretical works but impractical for standard training.

3. Key Contributions

1. Smoothness Adaptivity at Constant Depth

The paper proves that constant-depth networks (depth $L=6$ for $L^2$ and $L=7$ for $L^\infty$ ) with smooth activations achieve the minimax-optimal approximation rate:
$\|g - f^*\| \lesssim N^{-s/d}$
where $N$ is the total number of parameters. This holds for arbitrary smoothness $s > 0$ . The adaptivity is "automatic": increasing the width $M$ alone suffices; depth does not need to grow with $s$ .

2. Statistical Optimality via ERM

Building on the constructive approximation, the authors derive finite-sample estimation rates for Empirical Risk Minimization (ERM). They show that constant-depth networks with smooth activations achieve the minimax-optimal estimation rate:
$\mathbb{E}[\| \hat{f}_n - f^* \|_{L^2(\rho)}^2] \lesssim n^{-\frac{2s}{2s+d}} \log n$
This result holds without $\ell_0$ -sparsity constraints, making the theoretical guarantees applicable to standard training procedures.

3. The Depth Bottleneck for Non-Smooth Activations

The authors establish a lower bound for constant-depth ReLU networks. They prove that for a fixed depth $L$ , the approximation rate is bounded by $N^{-\min(L-1, s)}$ .

Implication: ReLU networks cannot achieve smoothness adaptivity at fixed depth; their approximation order is intrinsically capped by the depth. To approximate higher-order smoothness, ReLU networks must increase depth.
Separation: This creates a provable theoretical separation: smooth activations allow depth-independent adaptivity, while non-smooth activations require depth growth.

4. Key Results

Feature	Smooth Activations (This Work)	Non-Smooth (ReLU)
Depth Requirement	Constant ( $L=6, 7$ )	Must grow with $s$ or accuracy
Approximation Rate	$O(N^{-s/d})$ for any $s$	$O(N^{-\min(L-1, s)/d})$
Sparsity Constraint	None (Full parameter usage)	Often requires $\ell_0$ -sparsity
Norm Control	Explicit polynomial control	Often uncontrolled or requires constraints
Estimation Rate	Minimax optimal ( $n^{-2s/(2s+d)}$ )	Optimal only with depth growth or sparsity

Numerical Evidence:
The paper includes experiments on two-layer networks learning smooth target functions. Results show that smooth activations (GELU, Tanh) exhibit faster generalization error decay ( $E(n) \propto n^{-\alpha}$ with larger $\alpha$ ) compared to ReLU, empirically supporting the theoretical separation.

5. Significance and Impact

Rethinking Depth: The paper challenges the prevailing view in deep learning theory that "depth is the primary mechanism for smoothness adaptivity." It demonstrates that activation regularity is an alternative, theoretically sufficient mechanism.
Practical Relevance: By removing the need for $\ell_0$ -sparsity constraints and depth growth, the results provide a principled explanation for the empirical success of smooth activations in modern large-scale models (LLMs, Vision Transformers). These models often use constant depth relative to the problem complexity but rely on smooth activations to handle high-frequency or high-smoothness features.
Theoretical Foundation: The constructive proofs with explicit norm control bridge the gap between approximation theory and statistical learning theory, offering a unified framework for understanding why smooth activations are superior for learning Sobolev functions.

In summary, the paper establishes that smoothness in the activation function is a fundamental mechanism that enables neural networks to adapt to target function smoothness without increasing depth, offering a rigorous theoretical justification for the widespread adoption of smooth activations in modern deep learning.

Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations

1. The "Width vs. Depth" Trade-off

2. The "Adaptability" Superpower

3. Why This Matters (The "Why Should I Care?")

The Bottom Line

1. Problem Statement

2. Methodology

A. Multi-Scale Approximation Framework

B. Weighted Superposition Principle (for L∞L^\inftyL∞)

C. Complexity Control

3. Key Contributions

1. Smoothness Adaptivity at Constant Depth

2. Statistical Optimality via ERM

3. The Depth Bottleneck for Non-Smooth Activations

4. Key Results

5. Significance and Impact

More like this

NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

Poisson-response Tensor-on-Tensor Regression and Applications

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Eliciting core spatial association from spatial time series: a random matrix approach

Regularized estimation for highly multivariate spatial Gaussian random fields

B. Weighted Superposition Principle (for $L^\infty$ )