Covering Numbers for Deep ReLU Networks with Applications to Function Approximation and Nonparametric Regression

Imagine you are trying to build a machine that can learn to draw any picture, recognize any face, or predict the weather. You have a giant box of Lego bricks (neural networks) to build this machine. But in the real world, you can't use an infinite number of bricks. You have limits:

Size: You can only use a certain number of layers (depth) and width.
Precision: Your bricks can only be cut to specific sizes (quantization).
Connectivity: You can't connect every brick to every other brick; some connections must be cut (sparsity).
Strength: The force you can apply with a brick is limited (bounded weights).

This paper is like a master architect's guidebook. It answers a very specific question: "Given these strict limits, how many different 'shapes' (functions) can my machine actually make?"

To answer this, the authors use a concept called Covering Numbers.

The "Blanket" Analogy: What is a Covering Number?

Imagine you have a giant, messy pile of sand (all the possible things your neural network could try to do). You want to cover this pile with a finite number of small, identical blankets (representing specific, pre-defined network settings).

The Goal: You want to use as few blankets as possible, but they must be big enough that no part of the sand is left uncovered.
The Count: The number of blankets you need is the Covering Number.
The Insight: If you need a million blankets, your network is very complex and can do many different things. If you only need ten, it's very simple and limited.

The paper's main achievement is figuring out the exact minimum and maximum number of blankets needed for different types of Lego sets (networks). Before this, people only knew the "maximum" (upper bound) but didn't know the "minimum" (lower bound). It was like knowing you might need a million blankets, but not knowing if you could get away with just ten. The authors proved that the number is actually right in the middle, tightly bounded.

The Three Big Discoveries

The paper explores three specific scenarios, using some clever metaphors:

1. The "Precision" Game (Quantization)

The Scenario: Imagine you are painting a picture, but you can only use 8 specific shades of blue instead of a full rainbow.
The Finding: The authors found a "tipping point."

If you want a very rough sketch (large error allowed), the fact that you have limited colors doesn't matter much; you can still make almost anything.
But if you want a hyper-realistic painting (tiny error allowed), the limited colors suddenly become a huge bottleneck. The network hits a "wall" where it simply cannot create more detail, no matter how many layers you add.
The Metaphor: It's like trying to write a novel using only the letters A, B, and C. For a short note, it's fine. For a 500-page book, you run out of combinations very quickly.

2. The "Sparse" Game (Connectivity)

The Scenario: Imagine a social network where everyone could talk to everyone, but to save money, you cut 90% of the phone lines.
The Finding: The authors showed that cutting connections (sparsity) is a very efficient way to shrink the network without losing too much ability.

They proved that the "complexity" of the network scales with the number of active connections, not the total possible connections.
The Metaphor: It's like a city with a massive grid of roads. If you close most of the side streets but keep the main highways open, the city still functions almost as well as before. The "traffic" (information) can still flow where it needs to.

3. The "Prediction" Game (Regression)

The Scenario: You are trying to predict the stock market based on past data. You want to know how many data points (samples) you need to make a good guess.
The Finding: This is the paper's "cherry on top." By knowing the exact complexity of the network (the blanket count), they could prove that deep neural networks are optimal at learning.

Previous research said, "You need $N$ data points, plus a huge penalty factor involving logs (like $\log^6 N$ )." It was like saying, "To learn this, you need 100 apples, plus a tax of 50 extra apples."
This paper removed the "tax." They proved you only need the 100 apples.
The Metaphor: They found a shortcut. They showed that deep networks are the most efficient learners possible, stripping away the unnecessary "overhead" that previous theories thought was required.

Why Does This Matter?

Think of this paper as the Physics of AI.

For Engineers: It tells you exactly how much you can compress a model (make it smaller/faster) before it starts breaking. It answers: "Can I run this AI on a tiny smartwatch?" The math says, "Yes, but here is the exact limit."
For Scientists: It unifies different theories. It shows that the ability to approximate a function (draw a curve) and the ability to learn from data (predict the future) are two sides of the same coin. If you understand one, you automatically understand the other.

The Bottom Line

This paper took a messy, complicated field of mathematics and cleaned it up. It replaced vague estimates with precise, tight boundaries.

Before: "We think the network is complex, but we aren't sure how complex."
After: "We know exactly how complex it is, how much we can compress it, and exactly how much data it needs to learn."

It's the difference between guessing how many grains of sand are on a beach and having a formula that tells you the exact number, down to the grain. This allows us to build better, faster, and more reliable AI systems.

1. Problem Statement

The paper addresses a fundamental gap in the theoretical understanding of deep neural networks (DNNs): the lack of tight lower bounds on covering numbers.

Context: Covering numbers (and their logarithm, metric entropy) are crucial for characterizing the complexity of function classes realized by neural networks. They are used to derive upper bounds on approximation errors, prediction errors in nonparametric regression, and classification capacity.
The Gap: While upper bounds on covering numbers for ReLU networks are well-established (often via explicit weight quantization), corresponding lower bounds have been largely unavailable in the literature. Without tight lower bounds, it is impossible to determine if existing upper bounds are optimal or if there is room for improvement in network design and analysis.
Goal: The authors aim to derive tight (up to multiplicative constants) lower and upper bounds on the metric entropy of various classes of ReLU networks, including:
1. Fully connected networks with uniformly bounded weights.
2. Sparse networks (limited connectivity).
3. Networks with quantized weights.
4. Networks with unbounded weights but truncated outputs.

2. Methodology

The authors employ a combination of constructive proofs, combinatorial arguments, and information-theoretic techniques.

Upper Bounds: Constructed via explicit quantization of network weights. By discretizing the weight space to a grid with resolution dependent on the covering radius $\varepsilon$ , they create a finite set of networks that covers the original class. The cardinality of this set provides the upper bound.
Lower Bounds: Derived by relating the covering number of the network class to the packing number of specific function classes (e.g., piecewise linear functions).
- They utilize the fact that ReLU networks can efficiently realize 1D bounded continuous piecewise linear functions.
- They construct specific "hard" functions (using breakpoints) that require a large number of distinct network configurations to approximate within a given error, thereby forcing the covering number to be large.
- Key technical tools include fat-shattering dimension bounds (for unbounded weights) and depth-precision trade-offs (for quantized weights).
Applications: The tight bounds are applied using Proposition 3.1, which relates the minimax approximation error between two function classes to their covering numbers. This allows the authors to translate covering number results into limits on network transformation (compression/quantization) and regression performance.

3. Key Contributions and Results

A. Tight Bounds for Fully Connected Networks (Bounded Weights)

Result: The paper establishes that for fully connected ReLU networks with depth $L$ , width $W$ , and bounded weights $B$ , the metric entropy scales as:
$\log N(\varepsilon) \asymp W^2 L \log\left(\frac{(W+1)^L B^L}{\varepsilon}\right)$
Significance: This is the first result providing matching lower and upper bounds (up to constants) for this standard setting. It confirms that the scaling behavior of existing upper bounds is optimal.

B. Neural Network Transformation Limits

Using the tight bounds, the authors derive fundamental limits for:

Network Compression: It is impossible to approximate a large network by a significantly smaller one (in terms of parameters) without incurring a large approximation error, unless the weight magnitude of the smaller network is allowed to grow exponentially.
Quantization: The worst-case quantization error decreases no faster than exponentially with the number of bits used. Specifically, to maintain a fixed error $\kappa$ , the number of bits must scale logarithmically with the network size and weight magnitude.

C. Function Approximation of Lipschitz Functions

Result: The authors prove a tight minimax error bound for approximating 1-Lipschitz functions ( $H_1([0,1])$ ) using deep ReLU networks.
Improvement: Previous results (e.g., in [8]) had a suboptimal error rate involving a $(\log n)^6$ factor. By utilizing the tight covering number bounds instead of VC-dimension arguments, the authors remove this logarithmic factor, establishing the optimal sample complexity rate of $O(n^{-2/3})$ for Lipschitz functions.

D. Sparse and Quantized Networks

Sparse Networks: The bounds are extended to networks with connectivity $s$ . The metric entropy scales with $\min\{s, W^2 L\}$ , identifying $s$ as the "effective connectivity."
Quantized Weights: The paper analyzes networks with base-2 quantized weights. It identifies a phase transition behavior:
- For large $\varepsilon$ , the network behaves like a continuous-weight network.
- For small $\varepsilon$ , the covering number saturates, determined solely by the finite number of possible weight configurations ( $2^{W^2 L (a+b)}$ ), independent of $\varepsilon$ .

E. Unbounded Weights with Truncated Output

Result: For networks with unbounded weights but truncated outputs (common in practice to ensure stability), the authors derive a covering number upper bound of $O(W^2 L^2 \log(WL) \log(1/\varepsilon))$ .
Implication: This shows that allowing unbounded weights does not substantially improve approximation accuracy for Lipschitz functions compared to bounded weights, provided the output is truncated.

4. Significance and Impact

Fundamental Understanding: The paper provides the first rigorous, tight characterization of the complexity of deep ReLU networks. It resolves the ambiguity regarding whether existing upper bounds are loose or tight.
Optimality in Regression: By removing the $(\log n)^6$ factor from the sample complexity rate, the paper establishes that very deep ReLU networks are information-theoretically optimal for nonparametric regression of Lipschitz functions. This validates the use of deep networks in statistical learning theory.
Unification of Principles: The work unifies results on approximation theory and nonparametric regression. It reveals a systematic relationship: optimal regression rates are achieved when the metric entropy of the approximating network class is "balanced" with the metric entropy of the target function class (a concept related to Kolmogorov-Donoho optimality).
Practical Guidelines: The derived limits on quantization and compression provide theoretical benchmarks for practical engineering tasks, indicating the minimum precision and network size required to achieve specific performance guarantees.

Conclusion

This paper fills a critical theoretical void by providing tight lower bounds on covering numbers for deep ReLU networks. These bounds serve as a foundation for proving the optimality of deep networks in function approximation and nonparametric regression, while also clarifying the fundamental trade-offs in network compression and quantization. The results suggest that deep networks, when properly configured, achieve the best possible statistical efficiency for learning Lipschitz functions.