Lattice-based Deep Neural Networks: Regularity and Tailored Regularization

Imagine you are trying to teach a very smart, but very hungry, student (a Deep Neural Network or DNN) how to predict the weather. The student has access to millions of variables: temperature, humidity, wind speed, barometric pressure, cloud cover, etc.

The problem is that the student is easily confused. If you feed them random data points (like asking them to guess the weather based on a random Tuesday in 1998, then a random Tuesday in 2005), they might memorize those specific days but fail to understand the patterns of weather. This is called overfitting. They get an A on the test but fail the real world.

This paper proposes a new way to feed the student data, using a mathematical tool called Lattice Rules, and a new way to discipline the student called Tailored Regularization.

Here is the breakdown in simple terms:

1. The Problem: Randomness vs. Order

Usually, when training AI, we pick data points randomly, like throwing darts at a map.

The Dartboard Analogy: If you throw 100 darts randomly at a dartboard, you might end up with a big cluster in the top left and a huge empty hole in the bottom right. You miss parts of the picture.
The Lattice Solution: The authors suggest using Lattice Rules. Imagine instead of throwing darts randomly, you place them in a perfect, evenly spaced grid, like the dots on a piece of graph paper.
- Why it works: This ensures the student sees every part of the map evenly. They don't miss any "corners" of the problem. In math, this is called Quasi-Monte Carlo, and it's much more efficient than random guessing.

2. The Student's "Brain" (The Neural Network)

The neural network is a complex machine with many layers (like a multi-story building).

The Smoothness Issue: The authors noticed that for the student to learn the "weather patterns" perfectly, the function they are learning needs to be "smooth" (no sudden, jagged jumps).
The Activation Function: This is the "brain switch" inside the network. The paper looks at different switches:
- Sigmoid: A gentle, S-shaped curve.
- ReLU: A sharp, on/off switch (very common, but "jagged").
- Swish: A new, flexible switch that can be smooth or sharp depending on a setting.
- The Discovery: The authors found that if you use a "smooth" switch (like Sigmoid or Swish with the right settings), the student learns better if you also teach them to be disciplined.

3. The New Discipline: "Tailored Regularization"

In school, you might have a rule: "Don't write too much." In AI, this is called Regularization. It stops the student from memorizing the training data too perfectly (which leads to bad generalization).

Standard Discipline (Old Way): The standard rule is "Don't get too excited." (Mathematically, keep all numbers small). It's a generic "one-size-fits-all" rule.
Tailored Discipline (New Way): The authors created a custom rulebook based on the specific "weather" they are trying to predict.
- The Analogy: Imagine the weather in your city is mostly driven by the wind, but the humidity barely matters.
- The Old Rule: "Keep all your guesses small."
- The New Rule: "You can guess big numbers for the wind, but keep the humidity guesses tiny."
- How they did it: They looked at the mathematical "shape" of the problem they wanted to solve. They then forced the AI's internal numbers (weights) to match that shape. They told the AI: "Your brain is allowed to be complex in the areas where the problem is complex, but simple where the problem is simple."

4. The Results: Why This Matters

The authors tested this on a computer with 50 different variables (dimensions).

The Test: They compared the "Old Discipline" (Standard) vs. the "New Discipline" (Tailored) using the "Grid" data (Lattice).
The Outcome:
- With the Old Discipline, the student struggled to learn the pattern, even with a lot of data.
- With the New Discipline, the student learned the pattern much faster and more accurately.
- The "Magic" Part: Usually, as you add more variables (more dimensions), AI gets exponentially harder to train. This method proved that the "difficulty" doesn't explode as the number of variables grows. The student can handle high-dimensional problems just as well as low-dimensional ones, if you give them the right grid of data and the right custom rules.

Summary of the "Magic"

Don't throw darts randomly. Use a perfect grid (Lattice Rules) to show the AI the whole picture.
Don't use a generic rulebook. Create a custom rulebook (Tailored Regularization) that matches the specific shape of the problem you are solving.
The Result: You get a smarter AI that learns faster, makes fewer mistakes, and doesn't get overwhelmed by having too many variables to consider.

In a nutshell: The paper teaches us that to train a super-smart AI, you shouldn't just throw random data at it and hope for the best. You should give it a perfectly organized library of data and a custom-made syllabus that fits the subject matter perfectly.

1. Problem Statement

The paper addresses the challenge of training Deep Neural Networks (DNNs) for high-dimensional function approximation, specifically in scenarios where the target function is smooth but expensive to evaluate (e.g., quantities of interest in parametric Partial Differential Equations).

Standard DNN training typically relies on random sampling (Monte Carlo) or standard $L_2$ regularization. However, in high-dimensional settings, random sampling suffers from slow convergence ( $O(N^{-1/2})$ ), and standard regularization does not explicitly account for the specific regularity (smoothness and anisotropy) of the target function. The authors aim to bridge the gap between theoretical approximation bounds and practical DNN performance by:

Using Lattice Rules (Quasi-Monte Carlo methods) as training points to improve sampling efficiency.
Deriving explicit regularity bounds for DNNs to understand how network parameters affect approximation error.
Developing Tailored Regularization that forces the DNN parameters to match the known regularity features of the target function, thereby achieving dimension-independent generalization error bounds.

2. Methodology

A. Lattice Rules as Training Points

Instead of random training data, the authors employ Rank-1 Lattice Rules. These are deterministic point sets $\{t_k\}_{k=1}^N$ in the unit cube $[0,1]^s$ defined by a generating vector $\mathbf{z} \in \mathbb{Z}^s$ :
$t_k = \left\{ \frac{k \mathbf{z}}{N} \right\}$
To ensure unbiasedness and error estimation, they use randomly shifted lattice rules, where a random shift $\Delta$ is added to the points. This allows for root-mean-square error estimates.

B. DNN Architectures

The paper considers two specific DNN formulations:

Non-periodic DNN: Standard feed-forward networks.
Periodic DNN: A specialized architecture where the input is transformed via $\sin(2\pi \mathbf{y})$ before entering the network. This is designed to match target functions that are periodic (common in PDE applications).

C. Regularity Analysis

The core theoretical contribution involves deriving bounds on the mixed partial derivatives of the DNN output with respect to the input parameters.

Assumptions: The authors assume bounds on the network weights ( $W_\ell$ ) and the derivatives of the activation function $\sigma$ .
Activation Functions: They generalize the analysis to include sigmoid, tanh, and a family of generalized swish functions ( $x/(1+e^{-cx})$ ). They prove that the derivative bounds for these functions grow factorially ( $A_n \sim n!$ ), which is unavoidable (Lemma 2).
Theorem 1: Establishes that the mixed derivatives of the DNN are bounded by terms involving the network depth, weight norms, and the factorial growth of the activation function derivatives.

D. Tailored Regularization

To achieve optimal convergence, the network parameters must be restricted to match the decay rate of the target function's regularity (captured by a sequence $b_j$ ).

Standard Approach: Minimize standard $L_2$ loss + $\lambda \|\theta\|_2^2$ .
Proposed Approach: Introduce a Tailored Regularization term $R_1(\theta)$ :
$R_1(\theta) := \frac{1}{s} \sum_{j=1}^s \frac{1}{d_1} \sum_{p=1}^{d_1} \left( \frac{W_{0,p,j}^2}{L^2 b_j^2} \right)^{m/2}$
This term penalizes deviations where the input layer weights $W_0$ do not decay according to the target function's regularity sequence $b_j$ . The parameter $m=6$ is chosen to approximate the max-norm constraint while remaining smooth for gradient-based optimization.

3. Key Contributions

Explicit Regularity Bounds for DNNs: The paper provides rigorous bounds (Theorem 1) on the mixed derivatives of DNNs with smooth activation functions. It proves that these bounds depend on the network parameters and the factorial growth of the activation function derivatives.
Dimension-Independent Generalization Bounds: By combining the DNN regularity bounds with the error bounds of lattice rules in weighted function spaces (Sobolev and Korobov spaces), the authors prove (Theorem 3) that the generalization error converges at a rate of $O(N^{-r/2})$ $O (N^{- r /2})$ where the implied constant is independent of the input dimension $s$ .
- This holds for three settings: Non-periodic (Sobolev), Periodic (Hilbert Korobov), and Periodic (Non-Hilbert Korobov).
- The convergence rate $r$ depends on the "summability exponent" $p^*$ of the target function's regularity sequence.
Tailored Regularization Theory and Practice: The authors propose a specific regularization term derived from the theoretical constraints. They prove that if the network parameters satisfy these constraints (via the regularization), the theoretical error bounds hold.
Extension to Non-Smooth Limits: The paper analyzes the generalized swish function and shows that as the parameter $c \to \infty$ , it converges to ReLU. While ReLU is non-smooth (violating the smoothness assumptions), the numerical experiments show that the tailored regularization still improves performance, even if the strict theoretical bounds for ReLU do not apply.

4. Results

Theoretical Results

Theorem 3 demonstrates that by constructing lattice generating vectors based on the target function's regularity weights ( $b_j$ ) and using tailored regularization, the DNN achieves a convergence rate of $N^{-r/2}$ independent of dimension $s$ .
The Non-Hilbert Korobov setting (c) yields the fastest convergence rate ( $r = 1/p^*$ ), outperforming the Hilbert setting ( $r = 1/p^* - 1/2$ ).

Numerical Experiments

The authors trained periodic DNNs on a periodic algebraic function using:

Activation Functions: Sigmoid, Swish ( $c=1, 5, 25$ ), and ReLU.
Training Points: Randomly shifted embedded lattice points.
Comparison: Standard $L_2$ regularization vs. Tailored regularization.

Key Findings:

Superiority of Tailored Regularization: Across all activation functions and network depths ( $L=3$ and $L=12$ ), the tailored regularization consistently produced lower generalization errors and faster convergence of the generalization gap compared to standard $L_2$ regularization.
Effect of $c$ in Swish: As the parameter $c$ in the swish function increases (approaching ReLU), performance degrades slightly. This aligns with theory, as the derivative bound factor $S_L$ grows linearly with $c$ .
ReLU Performance: Despite ReLU being non-smooth (and thus outside the strict theoretical scope), the tailored regularization still yielded better results than standard regularization, though it was the worst performer among the tested smooth functions.
Convergence Rates: For the sigmoid and swish ( $c=1$ ) cases with tailored regularization, the generalization gap decayed at rates between $O(N^{-1})$ and $O(N^{-2})$ , quickly reaching the training error threshold ( $10^{-3}$ ).

5. Significance

Bridging Theory and Practice: The paper successfully translates abstract QMC theory (lattice rules and weighted spaces) into a practical DNN training strategy. It moves beyond "existence theorems" (proving a DNN can approximate a function) to "constructive theory" (providing a method to train a DNN to achieve specific error bounds).
Curse of Dimensionality Mitigation: The results show that by incorporating prior knowledge of the target function's regularity (via weights $b_j$ ) into both the sampling strategy (lattice rules) and the regularization term, DNNs can overcome the curse of dimensionality in high-dimensional integration and approximation tasks.
New Regularization Paradigm: The proposed tailored regularization offers a new direction for DNN training in scientific computing, where the structure of the problem (e.g., PDE coefficients) is known and can be exploited to constrain the network architecture, rather than relying solely on data-driven overfitting prevention.
Applicability: The methodology is particularly relevant for Uncertainty Quantification (UQ) in parametric PDEs, where inputs are often modeled as random fields with known decay properties, and evaluations are computationally expensive.