Some Super-approximation Rates of ReLU Neural Networks for Korobov Functions

Imagine you are trying to teach a robot to draw a very complex, multi-dimensional picture. This picture isn't just a simple sketch; it's a high-definition, 3D (or even 100D) landscape where every single direction has its own set of rules for how smooth and detailed the lines should be. In math terms, this is called a Korobov function.

The paper you shared is about teaching a specific type of robot brain—a ReLU Neural Network (the kind used in most modern AI)—how to copy these complex pictures with incredible speed and precision.

Here is the breakdown of what the authors discovered, using simple analogies:

1. The Problem: The "Curse of Dimensionality"

Usually, when you try to approximate a complex shape in a high-dimensional space (like a room with 100 walls instead of just 4), it gets exponentially harder. It's like trying to paint a giant mural by guessing the color of every single pixel. As the number of dimensions grows, the number of pixels explodes, and standard methods become hopelessly slow. This is the "Curse of Dimensionality."

2. The Solution: The "Sparse Grid" Strategy

Instead of painting every single pixel, the authors use a strategy called Sparse Grids.

The Analogy: Imagine you are trying to map a city. A standard map draws every single street. A Sparse Grid is like a map that only draws the major highways and the specific intersections where the action happens, ignoring the tiny alleyways that don't matter much.
Why it works: For these specific types of "Korobov" functions, the most important details are concentrated in specific patterns. By focusing only on these "highways," the neural network can ignore the noise and learn the shape much faster.

3. The Secret Weapon: "Bit Extraction"

The paper uses a clever trick called Bit Extraction.

The Analogy: Think of the neural network as a master lock-picker. The "target function" (the picture you want to copy) is a complex combination lock. The bit extraction technique allows the network to "pick" the lock by reading the binary code (the 1s and 0s) of the input numbers.
How it helps: By reading the "bits" of the input, the network can construct a very precise approximation of the function, almost like a digital zoom that gets sharper and sharper the more layers (depth) and width (size) you give it.

4. The Result: "Super-Approximation"

The authors proved that these neural networks don't just do a "good job"; they do a super job.

The Analogy: Imagine two students taking a test.
- Student A (Old Methods): If you double the time they study (network size), their grade improves by a little bit.
- Student B (This Paper's Method): If you double the time they study, their grade improves exponentially. They get "super-approximation."
The Math: They showed that for a function with a certain smoothness level ( $m$ ), the error (the mistake the network makes) drops incredibly fast as the network gets bigger. Specifically, the error shrinks at a rate of roughly $1/(\text{Size})^{2m}$ . This is much faster than anyone expected for these types of functions.

5. Why This Matters

Beating the Curse: The most exciting part is that this "super-speed" happens regardless of how many dimensions the problem has. Usually, adding dimensions slows you down. Here, the "Sparse Grid" + "Bit Extraction" combo means the network stays efficient even in massive, complex spaces.
Real World Impact: This helps explain why Deep Learning works so well in real life (like recognizing faces or driving cars). It suggests that for many real-world problems, neural networks are naturally equipped to find the "sparse" patterns that matter, ignoring the rest.

Summary

The paper is essentially a blueprint showing that if you give a neural network the right tools (Sparse Grids) and the right trick (Bit Extraction), it can learn complex, multi-dimensional shapes with superhuman efficiency, avoiding the usual slowdowns that happen when things get too complicated. It's like giving the robot a cheat sheet that lets it skip the boring parts and go straight to the answer.

1. Problem Statement

The paper addresses the fundamental question of characterizing the approximation power of Deep Neural Networks (DNNs) with Rectified Linear Unit (ReLU) activation functions. Specifically, it focuses on:

Target Function Class: Korobov functions ( $X^m_p(\Omega)$ ), which are functions defined on a hypercube $\Omega=[0,1]^d$ possessing mixed derivatives of order $m$ in each direction. This class is crucial for mitigating the "curse of dimensionality" in high-dimensional approximation.
Approximation Metrics: The study derives error bounds in two norms:
1. The $L_p$ norm ( $1 \le p < \infty$ ).
2. The Sobolev $W^1_p$ norm (measuring both function values and first-order derivatives).
Network Parameters: The error bounds are expressed in terms of the network width ( $W$ ) and depth ( $L$ ).
Core Challenge: While previous works established approximation rates for Korobov functions, there was a conjecture (by Yang and Lu, 2024) regarding the specific order of super-approximation for $m=2$ . The authors aim to disprove this conjecture and establish nearly optimal super-approximation rates for arbitrary smoothness order $m \ge 2$ .

2. Methodology

The authors employ a sophisticated combination of sparse grid finite element theory and the bit extraction technique to construct the approximating neural networks. The proof strategy involves three main components:

A. Sparse Grid Interpolation

Instead of standard tensor product grids, the authors utilize sparse grid interpolation ( $\Pi^m_n f$ ).

They approximate the target function $f$ using a hierarchical basis expansion: $\Pi^m_n f(x) = \sum_{|l|_1 \le n+d-1} \sum_{i \in I_l} v^m_{l,i} \phi^m_{l,i}(x)$ .
This approach inherently reduces the complexity from exponential in dimension $d$ (full grid) to polynomial, effectively addressing the curse of dimensionality.

B. Bit Extraction Technique

To approximate the sparse grid expansion with a ReLU network, the authors use the bit extraction technique (originally developed for VC-dimension bounds).

Index Mapping: They construct sub-networks that map input coordinates to discrete indices ($ind(i)$) corresponding to the sparse grid nodes.
Coefficient Approximation: Using the bit extraction capability of ReLU networks, they approximate the coefficients $v^m_{l,i}$ with high precision.
Basis Function Construction: They construct ReLU networks to represent the piecewise polynomial basis functions $\phi^m_{l,i}$ .

C. Network Composition and Aggregation

Product Approximation: Since the basis functions involve products of 1D functions, the authors utilize lemmas (specifically Lemma 2.5 and 2.6) that allow ReLU networks to approximate products of variables in the $W^1_\infty$ norm with high accuracy.
Handling $L_p$ vs. $W^1_p$ :
- For $L_p$ ( $p<\infty$ ): The authors construct the network on a slightly shrunken domain $\Omega_\epsilon$ (excluding small boundary strips) and extend the error bound to the full domain by choosing $\epsilon$ sufficiently small.
- For $W^1_p$ : A partition of unity $\{g_k\}$ is introduced. The domain is decomposed into subdomains $\Omega_k$ . The authors construct local networks approximating the function and its derivatives, then aggregate them using a product network to ensure global $W^1_p$ convergence without the domain extension issue.

3. Key Contributions

Disproof of Conjecture: The paper refutes the conjecture by Yang and Lu (2024) that the super-approximation rate for $X^2_p(\Omega)$ is $O(W^{-4+1/p}L^{-4+1/p})$ . The authors demonstrate that the rate is actually independent of $p$ and significantly higher.
Nearly Optimal Super-Approximation Rates:
- $L_p$ Norm: The error decays as $O(W^{-2m} L^{-2m})$ (ignoring logarithmic factors). This is a "super" rate because conventional methods typically achieve $O((WL)^{-m/d})$ or similar, whereas here the rate doubles with respect to the smoothness $m$ .
- $W^1_p$ Norm: The error decays as $O(W^{-2(m-1)} L^{-2(m-1)})$ .
Dimension Independence: The derived bounds show that the approximation rate depends on the smoothness $m$ but is largely unaffected by the dimension $d$ (except for logarithmic factors), confirming that ReLU networks effectively mitigate the curse of dimensionality for Korobov spaces.
Generalization to Higher Orders: The results hold for arbitrary $m \ge 2$ , extending previous work limited to $m=2$ .

4. Main Results (Theorems)

Theorem 1.1 (Lp Approximation):
For any $f \in X^m_p(\Omega)$ with $m \ge 2$ and $1 \le p < \infty$ , there exists a ReLU DNN $\phi$ with width $W$ and depth $L$ such that:
$\|f - \phi\|_{L_p(\Omega)} \le C \cdot |f|_{m,p} \cdot W^{-2m} L^{-2m} (\log W)^{O(d)} (\log L)^{O(d)}$

Significance: The convergence order is $2m$ with respect to both width and depth.

Theorem 1.2 (W1p Approximation):
For the same conditions, there exists a ReLU DNN $\phi$ such that:
$\|f - \phi\|_{W^1_p(\Omega)} \le C \cdot |f|_{m,p} \cdot W^{-2(m-1)} L^{-2(m-1)} (\log W)^{O(d)} (\log L)^{O(d)}$

Significance: The convergence order for derivatives is $2(m-1)$ .

Optimality:
The authors prove these bounds are nearly optimal by establishing lower bounds using VC-dimension arguments, showing that no ReLU network of width $W$ and depth $L$ can achieve a significantly faster rate for the worst-case function in the class.

5. Significance and Impact

Theoretical Advancement: The paper provides a rigorous theoretical justification for the empirical success of deep neural networks in high-dimensional problems. It establishes that ReLU networks possess "super-approximation" capabilities, outperforming classical approximation methods (like standard finite elements) by a factor related to the network depth and width.
Mitigation of Curse of Dimensionality: By focusing on Korobov spaces (mixed regularity), the work demonstrates that DNNs can approximate high-dimensional functions without the error bounds deteriorating exponentially with dimension $d$ .
PDE Solvers: Since solving Partial Differential Equations (PDEs) often requires minimizing the $W^1_p$ (energy) norm, the derivation of tight $W^1_p$ error bounds is critical for validating Physics-Informed Neural Networks (PINNs) and other DNN-based PDE solvers.
Architectural Insights: The paper suggests that the specific architecture (width/depth trade-off) and the use of bit-extraction mechanisms are key to achieving these high-order rates, offering guidance for designing efficient networks for smooth function approximation.

In summary, this paper establishes that ReLU neural networks can achieve nearly optimal, super-linear convergence rates for high-dimensional Korobov functions in both $L_p$ and $W^1_p$ norms, significantly improving upon previous bounds and disproving existing conjectures about the limitations of such networks.