Noisy PDE Training Requires Bigger PINNs

Imagine you are trying to teach a robot to predict the weather. You give it a set of rules (the laws of physics) and some historical data (temperature, wind speed, etc.). However, your historical data is messy—it's full of static, like a radio signal with a lot of noise.

This paper is about a specific type of AI called a Physics-Informed Neural Network (PINN). Think of a PINN as a very smart student who is trying to learn a subject (like solving complex math equations called Partial Differential Equations) by doing two things at once:

Studying the textbook: Following the strict laws of physics.
Memorizing the homework: Looking at the noisy data points you gave them.

The big question the authors asked is: "If the homework data is really messy (noisy), does it help to just give the student more homework pages, or do we need to make the student smarter?"

The Big Discovery: "Bigger Brains" are Needed for "Messy Data"

The authors found a surprising rule: If your data is noisy, simply adding more data points doesn't help much unless you also make the AI model much bigger.

Here is the analogy to explain why:

The Analogy: The Noisy Concert Hall

Imagine you are in a huge concert hall trying to hear a single violinist (the true solution to the math problem).

The Noise: The crowd is shouting, coughing, and clapping (this is your noisy data).
The Small AI: A small AI is like a person with average hearing. If the crowd is loud, they can't hear the violinist, no matter how many times you tell them, "Listen to the violin!" They get overwhelmed by the noise.
The Large AI: A large AI is like a super-sensitive hearing aid with a massive processor. It can filter out the crowd noise and isolate the violin.

The Paper's Finding:
If you have a small AI (a small neural network) and you give it 1,000 noisy data points, it will fail. It will just memorize the crowd noise.
However, if you give it 10,000 parameters (make the network "bigger" and more complex), it suddenly becomes capable of filtering out that same noise and finding the violin.

The "Free Lunch" Myth:
In machine learning, people often hope that "more data = better results" automatically. This paper says: No. If the data is dirty, more data is just more dirt. You cannot get a clean answer from dirty data unless your "filter" (the model) is big enough to handle the mess.

The "Threshold" Concept

The authors discovered a critical threshold.

Below the threshold: If your AI is too small, adding more noisy data is useless. The error stays high.
Above the threshold: Once you cross a certain size (make the network wide enough), the AI suddenly "clicks." It can start ignoring the noise and learning the true pattern.

It's like trying to see a star in a foggy night.

If you have a tiny telescope (small model), no matter how long you stare, you just see fog.
If you switch to a giant, high-powered telescope (large model), suddenly the star becomes visible, even though the fog (noise) is still there.

What They Tested

The researchers didn't just talk about this; they tested it on three very difficult real-world problems:

Navier-Stokes: Modeling how fluids (like water or air) move.
Poisson: Modeling things like heat distribution or electric fields.
HJB (Hamilton-Jacobi-Bellman): A complex equation used in robotics and finance to make optimal decisions.

In all these tests, they found the same pattern: The small networks failed to learn anything useful from the noisy data. Only the "bigger" networks could successfully learn the solution.

The Takeaway for Everyone

If you are building an AI to solve real-world problems (where data is never perfect), don't just throw more data at a small model.

Instead, you need to scale up your model. You need a "bigger brain" to handle the "messy world." If you want your AI to be accurate in a noisy environment, you must pay the price of making the model larger. There is no free lunch; to filter out the noise, you need a bigger filter.

Here is a detailed technical summary of the paper "Noisy PDE Training Requires Bigger PINNs" by Andre-Sloan, Mukherjee, and Colbrook.

1. Problem Statement

Physics-Informed Neural Networks (PINNs) are widely used to solve Partial Differential Equations (PDEs), particularly in high-dimensional settings where traditional numerical methods fail. However, real-world data is often noisy. A critical open question remains: Under what conditions can a PINN achieve an empirical risk (training error) lower than the variance ( $\sigma^2$ ) of the noisy supervision labels?

Intuitively, one might assume that adding more noisy data points ( $N_s$ ) would automatically improve accuracy. This paper challenges that notion, investigating whether simply increasing the dataset size is sufficient or if the model architecture (specifically its size/number of parameters, $d_N$ ) must also scale to "overcome" the noise.

2. Methodology

The authors employ a theoretical approach combining probabilistic analysis, covering numbers, and perturbation bounds to derive necessary conditions for PINN performance.

A. Mathematical Setup

Target PDE: The primary theoretical focus is the Hamilton–Jacobi–Bellman (HJB) equation, a nonlinear PDE arising in optimal stochastic control.
Loss Function: They define a semi-supervised PINN loss comprising:
1. PDE Residual: Enforcing the differential operator $L(u) \approx f$ .
2. Initial/Boundary Condition: Enforcing $B(u) \approx u_B$ .
3. Supervised Loss: Fitting noisy observations $\tilde{g} = g + \text{noise}$ at specific locations.
Noise Model: The supervision labels $y_i$ are modeled as $y_i = \mathbb{E}[y_i|x_i] + z_i$ , where $z_i$ is noise with variance $\sigma^2$ .

B. Theoretical Framework

The core of the methodology involves proving that for a "good" predictor (one with empirical risk $< \sigma^2 - \eta$ ) to exist with high probability, the model must satisfy specific size constraints. The proof proceeds in three main steps:

Decomposition (Lemma 4.5): The supervised risk is decomposed into the noise variance, the expected value, and the correlation between the predictor's output and the noise. To beat the noise variance, the predictor must correlate with the noise in a specific way.
Covering Number Analysis (Lemma 4.6): The authors analyze the probability of finding a network within an $\eta$ -cover of the function class that correlates highly with random noise. They show this probability is bounded by terms involving the number of parameters ( $d_N$ ) and samples ( $N_s$ ).
Perturbation Bounds (Lemma 4.7): They establish that the PINN risk changes in a controlled (though non-Lipschitz) manner when weights are perturbed, allowing them to relate the risk of a discrete cover to the continuous function space.

By combining these lemmas, they derive a contradiction: if the model is too small relative to the number of noisy samples, the probability of finding a network that fits the noise and generalizes well is vanishingly small.

3. Key Contributions

First Lower Bound on Model Size: The paper establishes the first theoretical lower bound on the number of trainable parameters ( $d_N$ ) required for a PINN to achieve an empirical risk below the noise variance ( $\sigma^2$ ).
Scaling Law: The derived condition is:
$d_N \log d_N \gtrsim N_s \eta^2$
This implies that the model size must scale super-linearly with the number of noisy samples to effectively utilize them.
No "Free Lunch" for Noise: The results demonstrate that adding more noisy supervision labels does not automatically reduce error. If the model size ( $d_N$ ) is below a critical threshold, increasing $N_s$ yields diminishing returns or no improvement in reducing error below $\sigma^2$ .
Extension to Unsupervised Settings: The authors extend the result to the case where only noisy boundary/initial conditions are available (unsupervised PINN), showing similar scaling constraints apply.

4. Results

Theoretical Results

Theorem 4.1: For the HJB PDE with noisy solution samples, if a predictor achieves risk $O(\eta)$ below $\sigma^2$ with probability $1-\delta$, then:
$d_N \left( \ln(d_N) + \ln\left(\frac{4W^2}{\eta^2}\right) \right) \geq \frac{N_s \eta^2}{144M^4} - 2\ln\left(\frac{4}{1-\delta}\right)$
This confirms that $d_N$ must grow with $N_s$ .

Empirical Results

The authors validated their theory on three distinct PDEs using various network widths:

Hamilton–Jacobi–Bellman (HJB): Confirmed the theoretical scaling.
Navier–Stokes (Taylor-Green Vortex): Tested on a complex fluid dynamics problem.
Poisson Equation: Tested with noise added to boundary conditions.

Observations:

Critical Threshold: In all experiments, PINNs with small widths (low $d_N$ ) failed to reduce training error below the noise variance $\sigma^2$ , regardless of training duration.
Performance Plateau: As $d_N$ increased, the error decreased until it plateaued at a level below $\sigma^2$ .
Threshold Behavior: There is a distinct "critical value" of model size. Below this value, the network cannot learn the signal from the noise; above it, the network successfully fits the underlying PDE solution.

5. Significance and Implications

Design Guidelines for PINNs: This work provides a quantitative foundation for designing PINNs in noisy environments. Practitioners cannot simply throw more data at a small network; they must ensure the network capacity ( $d_N$ ) is sufficiently large to handle the specific noise level and data volume.
Theoretical Understanding: It bridges the gap between statistical learning theory (sample complexity vs. model capacity) and scientific machine learning (PINNs), explaining why PINNs often fail to converge on noisy data if under-parameterized.
Future Directions: The authors suggest that these scaling laws likely apply to other complex PDEs and vector-valued systems (like full Navier-Stokes). The work opens a new avenue for rigorous investigation into network scaling requirements for scientific computing.

In summary, the paper proves that noise is not free: to extract signal from noisy PDE data, the neural network must be "big enough" to overcome the statistical barrier imposed by the noise variance.