Memorization capacity of deep ReLU neural networks characterized by width and depth

Imagine you are trying to teach a robot to memorize a list of $N$ specific items, like a phone book with $N$ names and numbers. But there's a catch: the "names" (the data points) are very similar to each other, almost like twins standing very close together in a crowded room. The closer they are, the harder it is for the robot to tell them apart without getting confused.

This paper is about figuring out the smallest possible robot (a neural network) needed to memorize these $N$ items perfectly, depending on how "tightly packed" the items are and how many different "numbers" (labels) there are.

Here is the breakdown using simple analogies:

1. The Problem: The "Crowded Room"

Imagine you have $N$ people standing in a room.

The Data: These people are the input.
The Labels: Each person has a specific hat color (the output).
The Separation ( $\delta$ ): How far apart the people are standing.
- If they are standing far apart, it's easy to point to "Person A" and say "Red Hat."
- If they are standing shoulder-to-shoulder (very small $\delta$ ), it's a nightmare. You need a very sharp eye (a complex network) to distinguish them.

2. The Robot's Anatomy: Width vs. Depth

The "robot" is a Deep Neural Network. Think of it as a factory assembly line:

Width ( $W$ ): How many workers are working side-by-side at each station. (Broad, shallow factory).
Depth ( $L$ ): How many stations (layers) the product has to go through. (Narrow, deep factory).

For a long time, scientists argued: "Do we need a wide factory or a deep one?" This paper answers that question by showing you can trade one for the other, like trading money for time.

3. The Big Discovery: The "Goldilocks" Formula

The authors built a specific type of robot (using a "ReLU" activation, which is like a simple on/off switch) that can memorize any $N$ people, no matter how crowded the room is.

They found a magic formula that links the size of the factory to the difficulty of the task:
$\text{Width}^2 \times \text{Depth}^2 \approx \text{Number of People} \times \log(\text{How crowded it is})$

In plain English:

If the people are far apart (easy task), you can get away with a tiny robot.
If the people are very close together (hard task), you need a bigger robot.
The Trade-off: You can make the robot wider (more workers) and shorter (fewer stations), OR narrower (fewer workers) and taller (more stations). As long as the product of their sizes squared stays within that formula, the robot will work.

4. How the Robot Works: The "Zip Code" Trick

How does this robot actually memorize the data without getting confused? The authors used a clever three-step trick:

The Projector (Step 1): Imagine taking a 3D photo of the crowded room and flattening it onto a single long line. The robot stretches the line out so that even though the people were close, they are now spaced out enough to have their own unique "address" (like a unique zip code).
The Encoder (Step 2): The robot takes these unique addresses and the hat colors, and writes them down into a giant, organized ledger. It groups people into blocks and writes their "zip codes" and "hat colors" as long strings of binary numbers (0s and 1s).
The Decoder (Step 3): When the robot sees a new person, it looks up their address in the ledger. It uses a "bit-extraction" technique (like a librarian pulling a specific page from a book) to find the exact hat color associated with that specific address.

5. Why This Matters: The "Optimality" Proof

The authors didn't just build a robot; they proved it's the best possible robot (almost).

They showed that you cannot build a significantly smaller robot to do this job. If you try to make the factory too small, it simply cannot distinguish between the crowded people.
They proved that their "Goldilocks" formula is the limit. You can't beat it by much, only by a tiny bit (logarithmic factors), which is like saying, "You can't build a car that gets 1000 miles per gallon; the best physics allows is 990."

Summary

This paper solves a puzzle about efficiency. It tells us exactly how much "brain power" (width and depth) a neural network needs to memorize a list of items, depending on how similar those items are.

If data is messy and crowded: You need a big, complex network.
If data is clean and spaced out: You can use a tiny, efficient network.
The Takeaway: You have the freedom to design your network to be wide or deep, as long as you respect the mathematical balance the authors discovered. This helps engineers build AI that is powerful but doesn't waste money on unnecessary hardware.

Here is a detailed technical summary of the paper "Memorization capacity of deep ReLU neural networks characterized by width and depth" by Xin Yang and Yunfei Yang.

1. Problem Statement

The paper addresses the fundamental question of memorization capacity (or the interpolation problem) in deep neural networks. Specifically, it seeks to determine the minimal size (in terms of width $W$ and depth $L$ ) required for a deep ReLU neural network to memorize any set of $N$ labeled data points $\{(x_i, y_i)\}_{i=1}^N$ satisfying specific geometric constraints:

Input Space: Points $x_i$ lie in the $d$ -dimensional unit ball ( $\|x_i\| \le 1$ ).
Separation: Points are pairwise separated by a distance $\delta$ ( $\|x_i - x_j\| \ge \delta$ for $i \neq j$ ).
Labels: Labels $y_i$ are discrete, taking values from a set of size $C$ ( $y_i \in [C]$ ).

Gap in Existing Literature:
Prior works typically characterized memorization capacity by the total number of parameters or neurons. While some studies established width-depth trade-offs, they were often restricted to uniformly distributed data or fixed-width architectures. This paper aims to generalize these results to non-uniform, separated data and explicitly characterize the trade-off between width and depth for optimal memorization.

2. Methodology

The authors employ a constructive approach to derive an upper bound on network size and a combinatorial/VC-dimension approach to derive a lower bound.

A. Constructive Upper Bound (Theorem 2.1)

The authors construct a specific deep ReLU network $F = F_3 \circ F_2 \circ F_1$ composed of three sub-networks:

Projection ( $F_1$ ):
- Maps high-dimensional inputs $x_i \in \mathbb{R}^d$ to a 1D space $z_i \in \mathbb{R}$ .
- Ensures the projected points are separated by at least 2 ( $|z_i - z_j| \ge 2$ ) and bounded within $[0, R]$ , where $R \approx 10N^2 \delta^{-1} \sqrt{\pi d}$ .
- This allows the integer parts $\lfloor z_i \rfloor$ to have unique binary representations.
Block Encoding ( $F_2$ ):
- Partitions the $N$ samples into blocks of size $S$ .
- Encodes the binary representations of the integer parts of the inputs and the binary representations of the labels into large integers $u_j$ and $w_j$ for each block $j$ .
- Uses piecewise linear functions to map an input $z_i$ to a tuple $(z_i, u_j, w_j)$ , where $j$ is the block index.
Bit Extraction ( $F_3$ ):
- Implements a sequential bit-extraction mechanism.
- It iteratively extracts bit segments from $u_j$ to identify which sample in the block matches the input $z_i$ .
- Once a match is found, it extracts the corresponding bit segment from $w_j$ to retrieve the label $y_i$ .
- Key Innovation: Introduces adjustable parameters $S$ (block size) and $T$ (layers per extraction step) to dynamically balance width and depth, breaking the rigid fixed-width constraints of previous works (e.g., Vardi et al., 2022).

B. Lower Bound Analysis (Theorem 3.2 & Proposition 3.3)

To prove optimality, the authors establish a lower bound on the network size required to shatter any set of $N$ separated points.

Method: They utilize Warren's Lemma to bound the number of sign patterns a neural network can produce on a fixed set of points.
Logic:
1. Construct a maximal $\delta$ -packing of the unit interval, yielding $T \approx \delta^{-1}$ points.
2. Show that the network must be able to shatter any subset of $N$ points from this packing with $C^N$ possible labelings.
3. Compare the number of achievable sign patterns (upper bounded by network complexity) against the required number of patterns ( $C^N$ ).
Result: This yields a lower bound on the product $W^2 L^2$ .

3. Key Contributions

Explicit Width-Depth Trade-off:
The paper provides a precise characterization of the relationship between width ( $W$ ) and depth ( $L$ ) for memorizing separated data. The main result shows that a network can memorize $N$ samples if:
$W^2 L^2 \lesssim N (\log(\delta^{-1}) + \log C)$
where $\delta$ is the separation distance and $C$ is the number of labels.
Optimality up to Logarithmic Factors:
The authors prove a matching lower bound:
$W^2 L^2 \gtrsim \frac{N \log C}{\log(\delta^{-1}) + \log C}$
When $\delta^{-1}$ is polynomial in $N$ (i.e., $\delta^{-1} \approx \text{poly}(N)$ ) and $C$ is constant, the upper and lower bounds match up to polylogarithmic factors ( $(\log N)^2$ ), proving the construction is nearly optimal.
Generalization of Previous Results:
- Extends results from uniformly distributed data (Yang, 2025) to arbitrary separated data.
- Generalizes fixed-width results (Vardi et al., 2022) by allowing dynamic resource allocation between width and depth via parameters $S$ and $T$ .
- Recovers the $O(\sqrt{N})$ parameter complexity for fixed-width networks when $\delta^{-1} \approx \text{poly}(N)$ .
Regime Transition Analysis:
The paper identifies a phase transition in memorization complexity based on the separation distance $\delta$ :
- Polynomial Separation ( $\delta^{-1} \approx \text{poly}(N)$ ): The number of parameters can be sub-linear in $N$ (specifically $O(\sqrt{N} \cdot \text{polylog}(N))$ for fixed width).
- Exponential Separation ( $\delta^{-1} \approx e^{cN}$ ): The number of parameters must be linear in $N$ ( $\Omega(N)$ ), aligning with Siegel (2026).

4. Key Results Summary

Metric	Upper Bound (Constructive)	Lower Bound (Theoretical)
Complexity Measure	$W^2 L^2$	$W^2 L^2$
Scaling	$O(N (\log \delta^{-1} + \log C))$	$\Omega\left(\frac{N \log C}{\log \delta^{-1} + \log C}\right)$
Case: $\delta^{-1} \approx \text{poly}(N)$	$O(N \log N)$	$\Omega(N / \log N)$
Implication	Optimal up to $(\log N)^2$ factors.	Confirms sub-linear parameter efficiency is possible.

Specific Network Configuration:
By tuning parameters $S$ and $T$ , the authors can achieve:

Fixed Width: If $W$ is constant, $L \approx \sqrt{N \log N}$ , leading to $P \approx \sqrt{N \log N}$ parameters.
Balanced Trade-off: The paper provides a formula to choose $W$ and $L$ to minimize parameters based on specific hardware constraints.

5. Significance and Future Directions

Theoretical Insight: The work clarifies that the "curse of dimensionality" in memorization can be mitigated by data separation ( $\delta$ ). It demonstrates that deep networks can be significantly more parameter-efficient than previously thought when data is well-separated.
Practical Implication: For resource-constrained scenarios (e.g., edge devices), this suggests that deep, narrow networks might be sufficient to memorize sparse datasets, provided the data separation is maintained.
Future Work: The authors suggest investigating:
- Whether standard optimization algorithms (GD/SGD) can find these minimal-sized solutions.
- Extending the trade-off analysis to other activation functions (GELU, Sigmoid).
- Analyzing manifold-valued and high-dimensional sparse data.
- Determining if the polylogarithmic factors in the bounds are inherent or artifacts of the proof.

In conclusion, this paper provides a rigorous theoretical framework for understanding how the geometry of data (separation) interacts with network architecture (width/depth) to determine the minimal resources needed for perfect memorization in deep ReLU networks.