The Inductive Bias of Convolutional Neural Networks: Locality and Weight Sharing Reshape Implicit Regularization

This paper demonstrates that the architectural inductive biases of locality and weight sharing in convolutional neural networks fundamentally alter implicit regularization by coupling learned filters to low-dimensional patch manifolds, thereby enabling generalization on high-dimensional spherical data where fully connected networks provably fail.

Tongtong Liang, Esha Singh, Rahul Parhi, Alexander Cloninger, Yu-Xiang Wang

Published 2026-03-06
📖 6 min read🧠 Deep dive

Here is an explanation of the paper "The Inductive Bias of Convolutional Neural Networks" using simple language and creative analogies.

The Big Picture: Why Do CNNs Win?

Imagine you are trying to teach a robot to recognize cats in photos. You have two types of robots:

  1. The "All-Seeing" Robot (Fully Connected Network): This robot looks at the entire photo as one giant, messy pile of pixels. It tries to memorize every single pixel's relationship to every other pixel.
  2. The "Local Detective" Robot (Convolutional Network/CNN): This robot uses a magnifying glass. It only looks at small, local patches of the image (like a cat's ear or a whisker) at a time. It uses the same magnifying glass (filter) to scan the whole picture.

The Mystery: Both robots are incredibly smart (they have millions of parameters) and can memorize a library of random noise perfectly. Yet, when you show them a new photo, the "Local Detective" (CNN) usually figures out it's a cat, while the "All-Seeing" Robot (FCN) gets confused and fails.

The Paper's Answer: This paper explains why the Local Detective is better. It's not just about the data; it's about how the robot's brain is built. The paper proves that the "Local Detective" has a built-in superpower called Implicit Regularization, which acts like a natural filter against overfitting, but only because of two specific design choices: Locality (looking at small patches) and Weight Sharing (using the same tool everywhere).


The Core Concept: The "Edge of Stability"

To understand the paper, we need to understand how these robots learn. They learn by taking steps down a hill (gradient descent) to find the lowest point (the best answer).

Usually, if you take steps that are too big, you overshoot the bottom and bounce around wildly. But recently, scientists noticed something weird: if you take steps that are just the right size (large but not too large), the robot settles into a special zone called the "Edge of Stability."

Think of this like a tightrope walker.

  • If they walk too slowly, they might fall off the side.
  • If they walk too fast, they fly off.
  • But if they walk at a specific "edge" speed, they find a balance where they can't fall off, even if the rope is wobbly.

The paper argues that when a robot learns at this "Edge," it is forced to find solutions that are stable. If a solution is too "jittery" or sensitive to tiny changes, the robot can't stay on the tightrope. It gets kicked off.

The Problem: The "Curse of Dimensionality"

Here is where the "All-Seeing" Robot (FCN) fails.
Imagine the photo is a high-dimensional sphere (a giant, multi-sided ball). The paper shows that for the "All-Seeing" Robot, the geometry of this sphere is a trap.

  • The Trap: In high dimensions, data points are so far apart that the robot can easily find a "jittery" solution that memorizes the training data perfectly but fails on new data.
  • The Result: Even at the "Edge of Stability," the "All-Seeing" Robot can't find a good general answer. It's like trying to find a needle in a haystack that keeps moving. The math says it cannot generalize well on spherical data.

The Solution: How CNNs Break the Trap

This is where the "Local Detective" (CNN) shines. The paper proves that Locality and Weight Sharing change the rules of the game.

1. Locality: The "Patch" Strategy

Instead of looking at the whole image, the CNN breaks the image into small patches (like a puzzle).

  • Analogy: Imagine you are trying to guess the weather.
    • FCN: Looks at the entire globe at once. It's too much data; it gets confused by the sheer size.
    • CNN: Looks at a 3x3 inch square of the sky. It sees a cloud. It looks at another square. It sees a cloud.
  • The Magic: Because the patches are small, the "Local Detective" doesn't see the scary, high-dimensional geometry of the whole image. It sees a simple, low-dimensional world. The math shows that as the image gets bigger (higher dimensions), the CNN actually gets better at generalizing. This is called the "Blessing of Dimensionality."

2. Weight Sharing: The "Same Tool Everywhere"

The CNN uses the same filter (the same magnifying glass) to scan every part of the image.

  • Analogy: Imagine a teacher grading 100 essays.
    • FCN: The teacher uses a different grading rubric for every single essay. They can easily cheat by memorizing the specific quirks of each essay.
    • CNN: The teacher uses one single rubric for all 100 essays.
  • The Magic: Because the teacher must use the same rubric, they can't cheat. If they try to memorize one essay, they break the rules for the others. This forces the teacher to learn the true rules of grammar (the underlying pattern) rather than memorizing specific words.
  • In the Paper: This "coupling" forces the robot to learn features that work for the entire distribution of patches, not just the specific training examples. It prevents the robot from finding those "jittery" solutions that only work for the training data.

The "Natural Image" Connection

The paper also looked at real photos (like CIFAR-10). They found that natural images have a special structure:

  • If you take a random patch from a photo, it usually looks like "grass," "sky," or "skin."
  • These patches are clustered. They aren't scattered randomly in space.
  • Because the patches are clustered, the "Local Detective" can easily find a stable solution that fits these clusters. The "All-Seeing" Robot, looking at the whole mess, can't see these clusters and gets lost.

Summary: The Takeaway

The paper solves a long-standing mystery: Why do Convolutional Neural Networks (CNNs) generalize so well while other networks struggle?

  1. The Environment: Learning happens at the "Edge of Stability," which forces models to avoid "jittery" solutions.
  2. The Failure: For standard networks (FCNs), the high-dimensional nature of data makes it easy to find "jittery" solutions that cheat the system.
  3. The Fix: CNNs use Locality (looking at small pieces) and Weight Sharing (using the same tool everywhere).
  4. The Result: These two features force the network to ignore the scary, high-dimensional complexity of the whole image and focus on the simple, clustered patterns of small patches. This allows them to generalize perfectly, even when the data is huge and complex.

In short: CNNs don't just "learn better"; their architecture forces them to learn in a way that is naturally resistant to memorization, turning the "Curse of Dimensionality" into a "Blessing."