Neural Networks Generalize on Low Complexity Data

This paper demonstrates that feedforward neural networks with ReLU activation, when selected via minimum description length to interpolate data generated from a simple programming language, generalize with high probability to unseen examples, effectively discovering complex patterns like primality testing without explicit design.

Sourav Chatterjee, Timothy Sudijono

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to recognize prime numbers (like 2, 3, 5, 7, 11). You show it a list of numbers and tell it "Yes" or "No" for each one.

In the real world, we often use massive, super-complex robots (neural networks) with millions of gears and levers to do this. Surprisingly, even if you give these robots a huge list of numbers and let them memorize the answers perfectly, they often still get new, unseen numbers right. But sometimes, if you give them a list of random gibberish, they memorize that too and fail completely.

The big mystery: Why do they work on "real" data but fail on "random" data?

This paper, written by Sourav Chatterjee and Timothy Sudijono, offers a simple answer: It's all about the complexity of the rules.

Here is the breakdown of their discovery, using some everyday analogies.

1. The "Simple Recipe" vs. The "Maze"

The authors invented a very simple, restricted programming language they call Simple Neural Programs (SNPs). Think of this like a very basic recipe book. You can only use simple steps: "Add this number," "Multiply by that," "If this is true, do that."

  • Low Complexity Data: This is data generated by a short, simple recipe. For example, the rule for "Prime Numbers" is actually quite simple if you write it out as a loop: "Check if any number divides evenly into the target."
  • High Complexity Data: This is data generated by a recipe that is a million pages long, or just random noise.

The authors proved that if the data comes from a short, simple recipe, a specific type of robot (a neural network) can learn it perfectly.

2. The "Minimalist Chef" (MDL)

The paper focuses on a specific strategy for training these robots called Minimum Description Length (MDL).

Imagine you are a chef trying to recreate a dish.

  • Approach A: You write a 1,000-page cookbook that lists every single ingredient and step for every single meal you've ever eaten. This is "overfitting." You memorized the specific meals, but you don't understand the concept of cooking.
  • Approach B (MDL): You try to write the shortest possible recipe that explains all the meals you've seen. You look for patterns. "Oh, every time I see a tomato, I add basil."

The paper shows that if the "true" rule behind the data is simple (like the Prime Number rule), the shortest possible recipe (the MDL network) will almost certainly be the correct one. Because it's the shortest, it hasn't wasted space memorizing random noise; it has found the underlying pattern.

3. The Magic Translation

One of the coolest parts of the paper is the "translation."
The authors showed that any simple recipe (SNP) can be perfectly translated into a neural network.

  • The Recipe: "If xx is divisible by 2, return 0."
  • The Network: A specific arrangement of mathematical layers that does exactly the same thing.

They proved that if a simple recipe exists, there is a "simple" neural network that does the same job. And because the network is "simple" (it has a short description), it generalizes well.

4. The Prime Number Example

Let's look at their prime number example again.

  • The Task: Tell if a number between 1 and 1,000,000 is prime.
  • The Data: You show the robot 1,000 random numbers and their answers.
  • The Result: The "Minimalist Chef" (MDL network) looks at those 1,000 examples. Instead of memorizing them, it finds the shortest code that fits. It discovers the logic of prime numbers.
  • The Prediction: When you show it a new number it has never seen, it guesses correctly with very high probability.

The paper calculates exactly how many examples you need. For prime numbers, you need roughly (lnN)2(\ln N)^2 examples. Since the density of primes is low, this is a manageable number. The robot learns the rule, not the list.

5. What About Noise? (The "Messy Kitchen")

What if you give the robot a list where some answers are wrong? (e.g., you accidentally tell it that 4 is a prime number).

  • The Bad News: If the noise is everywhere, the robot gets confused.
  • The Good News: The paper shows that if the noise is sparse (only a few mistakes), the "Minimalist Chef" is smart enough to ignore the mistakes. It realizes, "Hey, the rule 'check for divisors' fits 99% of the data perfectly. The 1% that doesn't fit must be typos."
  • This is called "Tempered Overfitting." The robot doesn't fail completely (catastrophic overfitting), but it doesn't get 100% perfect either. It finds a "good enough" balance.

The Big Takeaway

This paper solves a piece of the "AI mystery" by saying:
Neural networks generalize well not because they are magic, but because the world (or at least the data we care about) is often governed by simple, short rules.

If the truth is a simple sentence, the shortest neural network that fits the data will likely be that sentence. If the truth is a chaotic mess of random noise, no amount of training will help. The "Minimum Description Length" principle acts as a filter, forcing the AI to find the simplest explanation, which usually turns out to be the right one.

In short: If you teach a robot with the goal of finding the simplest explanation for what it sees, and the world actually is simple, the robot will become a genius.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →