Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations

This paper demonstrates that constant-depth neural networks equipped with smooth activation functions achieve smoothness adaptivity and minimax-optimal learning rates over Sobolev spaces by increasing width alone, whereas networks with non-smooth activations like ReLU require proportional depth growth to attain similar performance.

Yuhao Liu, Zilin Wang, Lei Wu, Shaobo Zhang

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot to draw a perfect, smooth curve, like the arc of a rainbow or the flow of a river. You have two types of "pens" (activation functions) to choose from:

  1. The "Lego Brick" Pen (ReLU): This is the most popular pen in the world right now. It's great at building sharp corners and straight lines. It's like stacking Lego bricks; you can build a staircase that looks like a curve if you make the steps tiny enough. But to get a really smooth curve, you need a lot of layers of bricks stacked on top of each other.
  2. The "Ink" Pen (Smooth Activations): This is the pen used in many modern, high-end AI models (like the ones writing this explanation). It draws with a continuous, flowing ink. It doesn't have sharp corners; it just flows.

The Big Question:
For a long time, scientists thought: "To draw a super-smooth curve, you need a very deep stack of Lego bricks (a deep neural network)." They believed that if you wanted to capture high levels of smoothness, you had to make the network deeper and deeper.

The Discovery:
This paper says: "Actually, you don't need the deep stack of bricks if you use the right ink."

Here is the breakdown of their findings using simple analogies:

1. The "Width vs. Depth" Trade-off

Think of a neural network as a factory assembly line.

  • Depth is the number of stations on the line.
  • Width is how many workers are at each station.

The Old Way (Lego/ReLU):
If you are using the "Lego" pen, your factory is limited. If you only have a short assembly line (constant depth), no matter how many workers you add (width), you can never make a perfectly smooth curve. You can only make a "staircase" that approximates a curve. To get a smoother curve, you must build a taller factory (increase depth).

The New Way (Smooth/Ink):
The authors prove that if you use a "Smooth" pen (like GELU or SiLU), you can keep the assembly line short (constant depth). You don't need to build a skyscraper of a factory. Instead, you just need to hire more workers (increase the width). With enough workers, a short line can draw a curve that is mathematically perfect, matching the smoothness of the target function.

2. The "Adaptability" Superpower

Imagine you are a tailor making suits.

  • The Lego Tailor: If a customer wants a suit with a simple shape, a short line of tailors works. But if the customer wants a suit with incredibly complex, smooth folds, the Lego tailor says, "I need to hire more tailors, but I also need to build a whole new floor in my factory." The complexity forces the factory to grow vertically.
  • The Ink Tailor: This tailor uses a magical smooth pen. No matter how complex or smooth the customer's request is, the tailor just says, "I'll just add more workers to my current small workshop." The factory stays the same size; only the workforce grows.

This is called Smoothness Adaptivity. The smooth activation functions allow the network to automatically adjust to how "smooth" the data is without needing to change the architecture's depth.

3. Why This Matters (The "Why Should I Care?")

  • Efficiency: Building deeper networks is hard. It requires more memory, more computing power, and is harder to train. If we can get the same (or better) results with a shallow network that is just wider, we save a massive amount of energy and money.
  • No "Sparsity" Tricks: Previous theories said, "You can only get these results if you force the network to be 'sparse' (meaning most of the workers are actually sleeping and doing nothing)." This paper shows you don't need those tricks. You can use a fully active, wide network and still get the best results.
  • Real-World Proof: The authors didn't just do math; they ran experiments. They trained simple, shallow networks with smooth pens (like GELU and Tanh) and compared them to the standard Lego pen (ReLU). The smooth pens learned the smooth curves faster and with less data.

The Bottom Line

For years, the AI community believed that Depth was the magic key to learning complex, smooth patterns. This paper flips the script. It shows that Smoothness in the activation function is the real magic key.

If you want to learn a smooth function, don't just build a deeper tower; just switch to a smoother pen and widen your net. You get the same (or better) results with a simpler, shallower structure. It's like realizing you don't need a 10-story ladder to reach the sky; you just need a really good trampoline.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →