Neural Networks Generalize on Low Complexity Data

Imagine you are trying to teach a robot to recognize prime numbers (like 2, 3, 5, 7, 11). You show it a list of numbers and tell it "Yes" or "No" for each one.

In the real world, we often use massive, super-complex robots (neural networks) with millions of gears and levers to do this. Surprisingly, even if you give these robots a huge list of numbers and let them memorize the answers perfectly, they often still get new, unseen numbers right. But sometimes, if you give them a list of random gibberish, they memorize that too and fail completely.

The big mystery: Why do they work on "real" data but fail on "random" data?

This paper, written by Sourav Chatterjee and Timothy Sudijono, offers a simple answer: It's all about the complexity of the rules.

Here is the breakdown of their discovery, using some everyday analogies.

1. The "Simple Recipe" vs. The "Maze"

The authors invented a very simple, restricted programming language they call Simple Neural Programs (SNPs). Think of this like a very basic recipe book. You can only use simple steps: "Add this number," "Multiply by that," "If this is true, do that."

Low Complexity Data: This is data generated by a short, simple recipe. For example, the rule for "Prime Numbers" is actually quite simple if you write it out as a loop: "Check if any number divides evenly into the target."
High Complexity Data: This is data generated by a recipe that is a million pages long, or just random noise.

The authors proved that if the data comes from a short, simple recipe, a specific type of robot (a neural network) can learn it perfectly.

2. The "Minimalist Chef" (MDL)

The paper focuses on a specific strategy for training these robots called Minimum Description Length (MDL).

Imagine you are a chef trying to recreate a dish.

Approach A: You write a 1,000-page cookbook that lists every single ingredient and step for every single meal you've ever eaten. This is "overfitting." You memorized the specific meals, but you don't understand the concept of cooking.
Approach B (MDL): You try to write the shortest possible recipe that explains all the meals you've seen. You look for patterns. "Oh, every time I see a tomato, I add basil."

The paper shows that if the "true" rule behind the data is simple (like the Prime Number rule), the shortest possible recipe (the MDL network) will almost certainly be the correct one. Because it's the shortest, it hasn't wasted space memorizing random noise; it has found the underlying pattern.

3. The Magic Translation

One of the coolest parts of the paper is the "translation."
The authors showed that any simple recipe (SNP) can be perfectly translated into a neural network.

The Recipe: "If $x$ is divisible by 2, return 0."
The Network: A specific arrangement of mathematical layers that does exactly the same thing.

They proved that if a simple recipe exists, there is a "simple" neural network that does the same job. And because the network is "simple" (it has a short description), it generalizes well.

4. The Prime Number Example

Let's look at their prime number example again.

The Task: Tell if a number between 1 and 1,000,000 is prime.
The Data: You show the robot 1,000 random numbers and their answers.
The Result: The "Minimalist Chef" (MDL network) looks at those 1,000 examples. Instead of memorizing them, it finds the shortest code that fits. It discovers the logic of prime numbers.
The Prediction: When you show it a new number it has never seen, it guesses correctly with very high probability.

The paper calculates exactly how many examples you need. For prime numbers, you need roughly $(\ln N)^2$ examples. Since the density of primes is low, this is a manageable number. The robot learns the rule, not the list.

5. What About Noise? (The "Messy Kitchen")

What if you give the robot a list where some answers are wrong? (e.g., you accidentally tell it that 4 is a prime number).

The Bad News: If the noise is everywhere, the robot gets confused.
The Good News: The paper shows that if the noise is sparse (only a few mistakes), the "Minimalist Chef" is smart enough to ignore the mistakes. It realizes, "Hey, the rule 'check for divisors' fits 99% of the data perfectly. The 1% that doesn't fit must be typos."
This is called "Tempered Overfitting." The robot doesn't fail completely (catastrophic overfitting), but it doesn't get 100% perfect either. It finds a "good enough" balance.

The Big Takeaway

This paper solves a piece of the "AI mystery" by saying:
Neural networks generalize well not because they are magic, but because the world (or at least the data we care about) is often governed by simple, short rules.

If the truth is a simple sentence, the shortest neural network that fits the data will likely be that sentence. If the truth is a chaotic mess of random noise, no amount of training will help. The "Minimum Description Length" principle acts as a filter, forcing the AI to find the simplest explanation, which usually turns out to be the right one.

In short: If you teach a robot with the goal of finding the simplest explanation for what it sees, and the world actually is simple, the robot will become a genius.

1. Problem Statement

The paper addresses the enduring mystery in deep learning: why do massively overparametrized neural networks generalize well on unseen data despite fitting training data to near-zero error (interpolation)?

Classical statistical learning theory (e.g., VC dimension) fails to explain this because it is distribution-independent and predicts that overparametrized models should overfit. While empirical evidence suggests neural networks prefer "simple" functions, there is a lack of rigorous theoretical guarantees linking the structural complexity of the data to the generalization performance of interpolating neural networks.

The authors aim to prove that if data is generated from a "low complexity" source (specifically, a short program), then a neural network that interpolates this data and minimizes its Description Length (MDL) will generalize with high probability.

2. Methodology

The authors develop a framework connecting simple programming languages to feedforward neural networks with ReLU activations.

A. Simple Neural Programs (SNPs)

The authors define a restricted programming language called Simple Neural Programs (SNPs).

Structure: SNPs consist of input statements, variable initialization (integers or booleans), value assignments, for loops, if statements, and basic arithmetic/logic operations (addition, multiplication by constants, comparisons).
Constraints: Variables are non-negative integers bounded by a value $B(N)$ . The language supports nested loops but disallows recursion and dynamic array access.
Examples: The language can express tasks like primality testing, Fibonacci sequence generation, and checking if a number is a sum of two squares.

B. Encoding SNPs into Neural Networks

A central theoretical contribution is the construction of a mapping from any SNP $P$ to a fully connected feedforward neural network $F_{P,N}$ with ReLU activations.

Mechanism:
- Variables: Each variable in the program corresponds to a specific node in the network.
- Statements: Each statement (assignment, loop, conditional) is encoded as a sequence of neural network layers.
- Loops: Nested for loops are encoded by repeating specific layer blocks. Crucially, the encoding of a loop with $B$ iterations is achieved by repeating a block of layers $B+1$ times, ensuring the network structure is independent of the specific input values but dependent on the bound $B$ .
- Logic: Boolean logic and comparisons are implemented using the identity properties of the ReLU function (e.g., $1\{x=0\} = \sigma(x+1) + \sigma(x-1) - 2\sigma(x)$ ).
Result: Theorem 3.1 proves that for any SNP $P$ , there exists a neural network $F_{P,N}$ that exactly computes $P(x)$ for all inputs in the domain $[N]^I$ .

C. Description Length and Compression

To apply Minimum Description Length (MDL) principles, the authors define a description length for the neural network parameters.

Compression Scheme: Since the neural network encoding of an SNP involves repeating layer blocks for loops, the parameter sequence is highly compressible. The authors define a "repetition-compressed representation" where repeated substrings of parameters are denoted by a base string and a repetition count.
Bound: Proposition 4.1 establishes that the description length of the network encoding an SNP of length $L$ with $V$ variables and bound $B(N)$ is bounded by $O(L^3 V^2 \ln B(N))$ .
Counting Argument: Lemma 4.1 shows that the number of distinct neural networks with description length $\leq K$ grows at most exponentially with $K$ ( $|N_K| \leq e^{cK}$ ).

3. Key Contributions and Results

A. Main Generalization Theorem (Theorem 5.1)

The core result states that for data generated by an SNP $P$ , the Minimum Description Length (MDL) interpolator generalizes well.

Setup: Let $P$ be an SNP. Let $(X_i, Y_i)$ be i.i.d. data where $Y_i = P(X_i)$ . Let $\hat{f}_{MDL}$ be a neural network that interpolates the data and has the minimum description length among all such interpolators.
Sample Complexity: If the number of samples $n$ satisfies:
$n = \Theta\left( \frac{L^3 V^2 \ln B(N) + \ln(1/\delta)}{\epsilon} \right)$
Then, with probability at least $1-\delta$ , the test error of $\hat{f}_{MDL}$ is at most $\epsilon$ .
Implication: The sample complexity depends on the complexity of the generating program (length $L$ , variables $V$ ), not the dimensionality of the input space or the size of the hypothesis space in the traditional sense.

B. Averaged Generalization (Corollary 5.1)

The paper provides a simpler, averaged bound on the error rate:
$P(\hat{f}_{MDL}(x) \neq P(x)) = O\left( \frac{L^3 V^2 \ln B(N)}{n} \right)$
This implies that for low-complexity programs, the error decays inversely with the sample size $n$ .

C. Application to Specific Tasks

The authors apply these results to concrete examples:

Primality Testing: For checking if a number $x \in [N]$ is prime, the MDL network achieves an error rate of $O(\frac{\ln N}{n})$ . Since the density of primes is $\approx 1/\ln N$ , the network requires $n \gg (\ln N)^2$ samples to classify both primes and non-primes accurately.
Sums of Squares & Triangle Inequalities: Similar bounds are derived, showing that the theory applies to various number-theoretic and geometric predicates.

D. Extension to Noisy Data (Theorem 7.1)

The framework is extended to corrupted data where a fraction $\rho$ of labels are arbitrarily noisy.

Tempered Overfitting: The MDL interpolator exhibits "tempered overfitting." The generalization error behaves as $O(\rho) + O(1/n)$ .
Mechanism: The proof shows that a network can interpolate the noisy data by adding a small "correction" network to the true program. If the noise is sparse, the description length of this correction is small, preserving the generalization guarantee.

4. Significance and Impact

Bridging Algorithms and Learning Theory: The paper provides a rigorous link between algorithmic complexity (program length) and statistical learning (generalization bounds). It formalizes the intuition that neural networks generalize because they implicitly prefer simple, program-like solutions.
MDL as a Justification for Interpolation: It offers a theoretical justification for why interpolating networks (which fit noise perfectly) can still generalize: if the true data is low-complexity, the simplest interpolator (MDL) will be close to the true function.
Tempered Overfitting: The results on noisy data contribute to the growing literature on the "three regimes" of overfitting (benign, tempered, catastrophic), showing that MDL networks naturally settle into the tempered regime where error is proportional to the noise level.
Limitations and Future Work:
- The theory relies on brute-force search to find the MDL network, which is computationally intractable for large networks. The paper does not prove that gradient descent finds these MDL solutions (though it cites empirical work suggesting a bias toward low complexity).
- The SNP language is restricted (no recursion, bounded variables), limiting the scope to specific types of structured data.
- The results are specific to ReLU networks; extensions to other architectures (Transformers, CNNs) are suggested as future work.

Conclusion

This paper demonstrates that generalization is possible for interpolating neural networks if the underlying data generation process is of low algorithmic complexity. By defining a specific programming language (SNP) and proving its efficient encoding into ReLU networks, the authors show that the Minimum Description Length principle provides a strong generalization guarantee, even in the presence of noise. This work moves beyond distribution-independent bounds, offering a structural explanation for the success of deep learning on real-world, structured data.