Original authors: Jose Marie Antonio Miñoza, Erika Fille T. Legara, Christopher P. Monterola

Published 2026-05-29

📖 6 min read🧠 Deep dive

Original authors: Jose Marie Antonio Miñoza, Erika Fille T. Legara, Christopher P. Monterola

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Idea: What is a Neural Network Actually Doing?

Imagine you have a black box (a neural network) that takes an input (like a picture of a cat) and gives you an output (the word "cat"). Usually, we think of this box as a complex machine with millions of gears (weights) turning to solve a puzzle.

This paper argues that the machine isn't just solving a puzzle; the machine is a specific type of physics equation in disguise. Specifically, it is a Hamilton–Jacobi equation.

To understand this, the authors introduce a single "magic knob" called $\epsilon$ (epsilon). Turning this knob changes how the network behaves, revealing four different ways to look at the same object:

The Smooth Network ( $\epsilon > 0$ ): The network acts like a gentle, flowing river. It considers all possibilities at once, giving soft, probabilistic answers (like "90% cat, 10% dog").
The Tropical Network ( $\epsilon = 0$ ): If you turn the knob all the way down, the river freezes into a single, sharp path. The network stops guessing and picks the single "best" option, acting like a rigid decision tree.
The Physics Equation: The network is actually calculating the solution to a heat equation (how heat spreads) or a wave equation.
The Optimization Problem: The network is solving a math problem to find the shortest or cheapest path.

The paper claims these aren't just similar ideas; they are exactly the same thing viewed through different lenses.

The Core Analogy: The "Heat Map" of Decisions

Think of the neural network as a heat map on a landscape.

The Input: You drop a hot stone (your data point) onto the map.
The Weights: The shape of the landscape (hills and valleys) is determined by the network's weights.
The Viscosity ( $\epsilon$ ): This is the "thickness" of the air.
- High Viscosity (Thick Air): The heat spreads out smoothly. The network is "soft" and considers many paths. It's like walking through deep mud; you can't rush, so you take a smooth, averaged route.
- Zero Viscosity (Thin Air): The heat doesn't spread; it travels in a straight line to the lowest point. The network becomes "hard" and picks the absolute best path instantly.

The paper proves that the Log-Sum-Exp (LSE) activation function (a common building block in modern AI) is the exact mathematical formula for how heat spreads in this specific type of physics problem.

How Different Architectures Fit In

The authors show that different types of neural networks are just different ways of simulating this same physics process:

Standard Feedforward Networks: These are like taking a snapshot of the heat spreading at a specific moment. Each layer is a step in time.
Residual Networks (ResNets): These are like a movie of the heat spreading. Instead of jumping from one snapshot to the next, they simulate the continuous flow of the "characteristics" (the paths the heat takes).
Transformers (like the ones powering chatbots): The "Attention" mechanism (how the model focuses on certain words) is actually calculating the average position of the heat based on a probability distribution. It's a "soft" version of picking the nearest neighbor.
Recurrent Networks (RNNs/LSTMs): These are like a river flowing over time, where the water's path depends on the current and the shape of the riverbed.

Why Does This Matter? (The "So What?")

By realizing that a neural network is just a physics equation, the authors can use math from physics to predict how AI behaves without needing to run thousands of experiments.

1. The "Goldilocks" Temperature
The paper calculates the perfect setting for that "magic knob" ( $\epsilon$ ).

If the knob is too low (too sharp), the network is brittle and can be easily tricked by tiny changes (adversarial attacks).
If the knob is too high (too soft), the network is too fuzzy and can't learn details.
The Result: There is a specific "sweet spot" based on how wide the network is and how complex the data is. Setting the knob here gives the best balance between learning fast and being robust.

2. Why Big Models Work (Scaling Laws)
We know that making models bigger usually makes them smarter. This paper explains why using a concept called "intrinsic dimension."

Imagine the data (like images of cats) lives on a crumpled piece of paper floating in a huge 3D room. Even though the room is big, the paper is only 2D.
The paper shows that the number of neurons needed to learn the data depends on the size of that "crumpled paper" (the intrinsic dimension), not the size of the room. This explains why we see specific mathematical patterns in how performance improves as we add more data or parameters.

3. "Hallucinations" are Predictable
When an AI makes things up (hallucinates), it's often because it's looking at data it hasn't seen before.

The paper shows that in these "unknown" areas, the network's behavior is mathematically predictable. It will essentially "slide" down the nearest hill it knows, extrapolating linearly. It's not magic; it's just the physics of the equation running out of data to guide it.

4. Training is Like Backtracking
When we train a network (backpropagation), we are essentially running a physics simulation backward.

The paper proves that the algorithm we use to update the weights is mathematically identical to a method used in physics called the Pontryagin Maximum Principle. It's not a heuristic guess; it's the exact mathematical way to solve the "optimal control" problem of the network.

The "Tropical" Limit: The Decision Tree

Finally, the paper connects deep learning to something much older: Tropical Algebra.

In normal math, you add and multiply.
In "Tropical" math (the limit where $\epsilon = 0$ ), you only use Max and Add.
The paper shows that if you turn the knob all the way down, a complex neural network collapses into a simple Decision Tree (a series of "If this, then that" rules).
This means a deep neural network is just a "smoothed out" version of a decision tree. The "soft" probabilities we see in AI are just the tree's way of hesitating before making a hard choice.

Summary

This paper claims that deep learning isn't a mysterious black box. It is a physics engine.

The weights are the initial conditions of a heat equation.
The forward pass is the heat spreading out.
The backward pass is the heat flowing backward to find the source.
The knob ( $\epsilon$ ) controls whether the system acts like a smooth fluid (modern AI) or a rigid crystal (decision trees).

By understanding the network as a physics equation, we can predict its limits, its robustness, and exactly how much data and computing power we need to solve a problem.

Technical Summary: The Hamilton–Jacobi Theory of Deep Learning

Problem Statement

The paper addresses a fundamental theoretical gap in deep learning: while neural networks are often used to approximate solutions to partial differential equations (PDEs), the question of what specific equation a trained neural network solves has remained largely unanswered. Conventional approaches treat the PDE as an external constraint imposed via loss functions (e.g., Physics-Informed Neural Networks). This work posits that the architecture itself, specifically layers utilizing Log-Sum-Exp (LSE) activations, intrinsically encodes the solution to a viscous Hamilton–Jacobi (HJ) equation. The core challenge is to establish an exact, non-approximate correspondence between neural network operations and the mathematical structures of HJ PDEs, tropical algebra, and convex optimization, unified by a single deformation parameter $\epsilon$ .

Methodology

The authors employ a unified mathematical framework centered on Maslov dequantization and the Hopf–Cole transformation.

The Deformation Parameter ( $\epsilon$ ): The paper identifies $\epsilon$ (the softmax temperature) as a deformation parameter that interpolates between two algebraic worlds:
- $\epsilon > 0$ : The standard arithmetic semiring $(\mathbb{R}, +, \times)$ , where the network operates as a smooth, entropy-regularized system.
- $\epsilon \to 0$ : The tropical semiring $(\mathbb{R}, \max, +)$ , where the network collapses to a max-affine spline (MASO) or decision tree.
  This transition is an exact semiring homomorphism, not a numerical approximation.
The LSE Layer as a PDE Solver: The authors demonstrate that a single feedforward layer with LSE activation, defined as $f_\epsilon(x) = \epsilon \log \sum_j \exp((W_j \cdot x + b_j)/\epsilon)$ , is algebraically identical to the Hopf–Cole solution of a viscous Hamilton–Jacobi equation:
$\partial_t u + H(\nabla u) = \epsilon \Delta u$
Specifically, for a quadratic Hamiltonian $H(p) = |p|^2$ , the layer output is exactly related to the PDE solution $u_\epsilon(x,t)$ via a quadratic shift: $f_\epsilon(x) = |x|^2/(4t) - u_\epsilon(x,t)$ . The weights $W$ and biases $b$ encode the initial data $g(y)$ and support points $y_j$ of the PDE's initial condition.
Architectural Generalization: The framework extends beyond simple feedforward networks:
- ResNets: Interpreted as Euler discretizations of the characteristic ODEs of the HJ equation.
- Transformers: Attention mechanisms are identified as vector-valued Hopf–Cole averages (Gibbs expectations) under a specific temperature scaling ( $\epsilon = \sqrt{d}$ ).
- RNNs/SSMs: Viewed as discretizations of time-dependent characteristic equations.
Commutative Diagram: The paper constructs a commutative diagram linking four perspectives: Neural Networks, Tropical Algebra, Viscous/Inviscid PDEs, and Convex Optimization. The limits $\epsilon \to 0$ (ultradiscretization) and $N \to \infty$ (infinite width) commute under Lipschitz conditions.

Key Contributions

The paper establishes five primary theoretical results:

Exact Algebraic Identity (Theorem 4.1): It proves that an LSE-activated layer is not merely an approximation but an exact discrete-measure instantiation of the Hopf–Cole solution to a viscous HJ equation. No residual loss is required; the PDE is satisfied by construction.
Tropical Limit and Convex Optimization (Theorem 5.1): It rigorously shows that as $\epsilon \to 0$ , the network converges to the Hopf–Lax formula, which is simultaneously the unique viscosity solution of the inviscid HJ equation, a tropical inner product, and a linear program (MASO).
Unified Commutative Diagram (Theorem 7.1): It unifies the four perspectives (NN, Tropical, PDE, Optimization) into a single framework where limits can be exchanged. This confirms that the network is a "universal classical HJ simulator" for quadratic Hamiltonians.
Quantitative Consequences:
- Generalization (Theorem 8.1): Derives a minimax optimal generalization rate of $O(n^{-1/(d+2)})$ by balancing approximation error (quadrature) and estimation error, linking the optimal viscosity $\epsilon^*$ to the network width $N$ and data dimension $d$ .
- Adversarial Robustness (Corollary 8.2): Provides a certified robustness bound where the Hessian norm is inversely proportional to $\epsilon$ , proving that viscosity controls the network's sensitivity to perturbations.
- Backpropagation (Theorem 8.4): Identifies backpropagation as the co-state equation (adjoint system) of the Hamiltonian system governing the network, formally linking training to the Pontryagin Maximum Principle (PMP).
- Scaling Laws (Proposition 8.8): Explains empirical scaling laws ( $L \propto N^{-\alpha}$ ) as a consequence of the intrinsic dimension $d_{eff}$ of the data manifold, predicting $\alpha = 1/d_{eff}$ .
Influence Functions and Bifurcation (Theorem 8.9): Derives a closed-form $O(N)$ influence function for softmax weights and characterizes the "attribution entropy landscape," showing that as $\epsilon$ increases, the landscape undergoes fold bifurcations where attribution basins merge.

Results

The paper validates its theoretical claims through both analytical proofs and numerical experiments:

Identity Verification: Numerical checks confirm the LSE-PDE identity holds to machine precision ( $\sim 10^{-16}$ ) across various $\epsilon$ values and dimensions.
Quadrature Convergence: Experiments on synthetic data demonstrate that the approximation error decays as $O(N^{-1/d})$ , confirming the theoretical quadrature bounds.
Scaling Laws: Trained networks exhibit scaling exponents consistent with the intrinsic dimension of the data, validating the link between PDE quadrature theory and empirical scaling laws.
Robustness: Experiments on MNIST and CIFAR-10 verify that increasing $\epsilon$ reduces the spectral norm of the Hessian and enlarges the certified adversarial radius, matching the theoretical bounds.
Bifurcation Analysis: Visualizations of the attribution entropy landscape confirm the predicted fold bifurcations as viscosity increases, showing the transition from "particle-like" (sharp, discrete attribution) to "wave-like" (diffusive, uniform attribution) regimes.

Significance and Claims

The paper claims to provide a unifying mathematical theory of deep learning that resolves the question "What equation does a neural network solve?" with an exact answer: a trained LSE network solves a viscous Hamilton–Jacobi initial-value problem.

Unification: It connects disparate fields—Maslov dequantization, Hopf–Cole linearization, ResNet-as-ODE, and scaling laws—into a single commutative diagram.
Exactness: Unlike previous works that view networks as approximators of PDEs, this work asserts the network is the PDE solution operator.
Design Principles: The theory yields actionable prescriptions, such as setting the optimal temperature $\epsilon^* \approx N^{-1/d}$ to minimize generalization error and using $\epsilon$ to control the robustness-expressiveness trade-off.
Physical Analogue: The framework draws a precise parallel between neural computation and physics: the network is a "universal classical HJ simulator" (analogous to Feynman's universal quantum simulator), where the Gibbs measure is positive (classically tractable) unlike the Wigner function in quantum mechanics.

The authors emphasize that while the exact correspondence holds for quadratic Hamiltonians (LSE layers), the structural insights extend to broader architectures (ResNets, Transformers, RNNs) as discretizations of HJ characteristics, providing a rigorous foundation for understanding deep learning dynamics, generalization, and robustness through the lens of PDE theory.

The Hamilton-Jacobi Theory of Deep Learning