Training Deep Physics-Informed Kolmogorov-Arnold… — Plain-Language Explanation

Original authors: Spyros Rigas, Fotios Anagnostopoulos, Michalis Papachristou, Georgios Alexandridis

Published 2026-01-22

📖 5 min read🧠 Deep dive

Original authors: Spyros Rigas, Fotios Anagnostopoulos, Michalis Papachristou, Georgios Alexandridis

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a computer to solve complex physics puzzles, like predicting how heat spreads through a metal plate or how water flows around a boat. For years, the standard tool for this job has been a type of AI called a Neural Network (specifically, a Physics-Informed Neural Network, or PINN). Think of these networks as a team of workers trying to solve a maze.

Recently, a new, smarter type of worker called a KAN (Kolmogorov–Arnold Network) was introduced. KANs are like workers who can change their own tools as they work, making them incredibly flexible and accurate. However, there's a catch: when you try to build a very deep team of KANs (a "deep architecture" with many layers of workers), the team often falls apart. They get confused, their signals get lost, and they stop learning entirely. It's like trying to whisper a secret through a line of 20 people; by the time it reaches the end, it's just noise.

This paper introduces two major fixes to make deep KAN teams work reliably.

1. The "Glorot-like" Initialization: Setting the Right Volume

The Problem: When you start a new KAN team, you have to assign them their starting "volume" (mathematically, their initial weights). The old method was like guessing the volume knob; sometimes it was too quiet (the signal dies), and sometimes it was too loud (the signal explodes). This made training deep teams impossible.

The Solution: The authors invented a new way to set that starting volume, called a "Glorot-like initialization."

The Analogy: Imagine tuning a radio before a broadcast. The old method was just turning the dial randomly. The new method is like using a precise scientific instrument to find the exact frequency where the signal is clearest, no matter what kind of music (basis function) the station is playing.
The Result: By using this precise "tuning," the KANs stay stable. They can learn much deeper and more complex puzzles without losing their way. In many tests, this simple fix made the AI's answers thousands of times more accurate than before.

2. The RGA KAN: The "Residual-Gated" Safety Net

The Problem: Even with the perfect volume setting, some very deep teams (especially for tricky puzzles like the Allen-Cahn equation) still got stuck. They would start learning, but then hit a wall and stop improving.

The Solution: The authors built a new architecture called RGA KAN (Residual-Gated Adaptive KAN). They took inspiration from a previous design called "PirateNet" and added a special mechanism.

The Analogy: Imagine a relay race. In a standard deep network, the baton is passed from runner to runner in a straight line. If one runner drops it, the whole race is over.
The RGA KAN adds a "smart gate" at every step. This gate acts like a referee who can decide: "Do I pass the baton to the next runner, or do I let the current runner keep running for a bit longer?"
- The "Gate" (Alpha and Beta): These are adjustable dials. At the start, the gate might be closed, letting the team run as a shallow, simple group. As training progresses, the gate opens, allowing the team to grow deeper and tackle harder problems. If the team starts to get confused, the gate can close slightly to stabilize them.
The Result: This "safety net" allows the AI to go as deep as needed without falling apart. It successfully navigates the entire learning process, whereas the old methods would get stuck in the middle.

How They Proved It Worked

The researchers tested their new system on nine different physics puzzles (like the heat equation, fluid flow, and wave equations).

The Competition: They compared their new RGA KAN against the standard cPIKAN (the old KAN method) and PirateNet (the current best MLP method).
The Outcome: The RGA KAN won almost every time.
- Accuracy: It was often orders of magnitude more accurate (meaning the errors were tiny fractions of what the others produced).
- Stability: When the other methods crashed (diverged) and gave up on the harder puzzles, the RGA KAN kept going and found the solution.
- Consistency: It didn't matter which random starting point they used; the new method was reliable.

The "Secret Sauce" of Training

The paper also tested different "training strategies" (like adjusting how much attention the AI pays to different parts of the puzzle). They found that while the new architecture was the main hero, combining it with specific adaptive techniques (like RBA and RAD) made it even stronger. However, even without these extra tricks, the new architecture was far superior to the old ones.

Summary

In simple terms, this paper says:

Old KANs were great but fragile when made too deep.
Fix #1: We found a better way to start them off (Initialization) so they don't get confused immediately.
Fix #2: We built a new "smart gate" system (RGA KAN) that lets the AI grow deeper safely, acting like a safety net that prevents it from falling off a cliff.
Result: This new system solves complex physics problems much better and more reliably than the current state-of-the-art methods, often by huge margins.

The authors conclude that while their system is slightly slower to compute (because it's doing more complex math), the massive gain in accuracy and stability makes it worth it, especially for difficult problems where other methods simply fail.

Technical Summary: Training Deep Physics-Informed Kolmogorov–Arnold Networks

Problem Statement
Kolmogorov–Arnold Networks (KANs) have emerged as a promising alternative to Multilayer Perceptrons (MLPs) in Physics-Informed Machine Learning (PIML), offering enhanced interpretability and robustness against spectral bias. Specifically, Chebyshev-based Physics-Informed KANs (cPIKANs) have become a standard due to their computational efficiency compared to B-spline variants. However, cPIKANs face significant challenges when scaled to deep architectures. Empirical studies indicate that as network depth increases, cPIKANs suffer from training instabilities and divergence, limiting their applicability to complex Partial Differential Equation (PDE) problems. Furthermore, existing weight initialization schemes for KANs remain largely ad hoc, lacking a theoretical foundation comparable to the Glorot initialization used for MLPs. Additionally, there is a lack of a unified training pipeline incorporating adaptive strategies for cPIKANs, and the mechanisms behind their failure in deep regimes are not fully understood.

Methodology
The authors propose a two-pronged approach to address depth-scaling limitations in cPIKANs: a novel initialization scheme and a new deep architecture.

Basis-Agnostic Glorot-like Initialization:
The authors derive a weight initialization scheme for KANs based on variance preservation during both forward and backward passes. Unlike previous heuristics specific to B-splines, this scheme is "basis-agnostic," meaning it does not assume a specific basis function family. By analyzing the variance of the output signal and its gradient with respect to the input, they derive a standard deviation for the basis coefficients ( $w_{jim}$ ) that balances the contributions of the input dimension ( $d_I$ ), output dimension ( $d_O$ ), and the number of basis functions ( $D$ ). This approach aims to prevent vanishing or exploding gradients, mirroring the success of Glorot initialization in MLPs.
Residual-Gated Adaptive KANs (RGA KANs):
Recognizing that initialization alone is insufficient for all deep PDE settings (e.g., the Allen–Cahn equation), the authors introduce the RGA KAN architecture, inspired by the PirateNet architecture for MLPs. Key components include:
- Embedding: Periodic boundary conditions are enforced via sine/cosine embeddings.
- Sine-based Input Layer: A sine-based KAN layer processes the embedded input, acting similarly to Random Fourier Feature (RFF) embeddings.
- Adaptive Skip Connections: The core innovation involves stacking "RGA blocks." Each block contains Chebyshev-based KAN layers and learnable gating parameters ( $\alpha$ and $\beta$ ). These gates dynamically modulate the effective depth of the network during training. Specifically, $\alpha$ controls the skip connection for the entire block, while $\beta$ controls the skip connection after the first layer within the block. This allows the network to start shallow (if initialized with $\alpha=0$ ) and progressively deepen, or start deep and adaptively prune, stabilizing optimization.
- Physics-Informed Output: The final layer can be initialized to approximate the initial condition of the PDE via a least-squares fit.
Information Bottleneck (IB) Analysis:
To understand the training dynamics, the authors apply Information Bottleneck theory. They monitor the Signal-to-Noise Ratio (SNR) of gradients and the geometric complexity of the network. They hypothesize that successful training requires traversing three phases: fitting, diffusion, and diffusion equilibrium.
Unified Training Pipeline:
Experiments utilize a standardized pipeline incorporating adaptive techniques common in PINNs: Residual-based Attention (RBA), Residual-based Adaptive Distribution (RAD), causal training, and Learning Rate Annealing (LRA).

Key Contributions

Derivation of a Glorot-like Initialization: A theoretical derivation of a basis-agnostic initialization rule that significantly improves the stability and accuracy of cPIKANs over default schemes.
Introduction of RGA KANs: A novel deep architecture designed to mitigate divergence in deep cPIKANs through adaptive skip connections and gating mechanisms.
Theoretical Insight via IB Theory: An analysis demonstrating that RGA KANs successfully traverse all three training phases (fitting, diffusion, diffusion equilibrium), whereas baseline cPIKANs often stagnate in the diffusion phase, failing to generalize.
Comprehensive Benchmarking: Extensive evaluation on nine standard forward PDE benchmarks (including Burgers', Allen–Cahn, Korteweg–De Vries, Sine Gordon, Advection, Helmholtz, Poisson, Heat, and Navier-Stokes equations) comparing RGA KANs against parameter-matched cPIKANs and PirateNets.

Results

Initialization Impact: The proposed Glorot-like initialization consistently outperforms the default cPIKAN initialization in function fitting and PDE tasks, often reducing relative $L_2$ errors by several orders of magnitude. In deep networks (e.g., Burgers' equation), the default initialization leads to divergence, while the proposed scheme maintains stability.
Architecture Performance: RGA KANs demonstrate superior stability and accuracy compared to both baseline cPIKANs and PirateNets. In benchmarks where cPIKANs and PirateNets diverge (e.g., Allen–Cahn, Advection, Korteweg–De Vries, Sine Gordon), RGA KANs converge to accurate solutions.
Error Reduction: Across nine PDE benchmarks, RGA KANs consistently outperform parameter-matched baselines, often by several orders of magnitude. For instance, in the Helmholtz equation, RGA KANs achieved errors in the $O(10^{-5})$ range, outperforming cPIKANs ( $O(10^{-3})$ ) and PirateNets ( $O(10^{-4})$ ).
Ablation Studies: The contribution of adaptive components (RBA, RAD, causal training, LRA) varies by PDE. While RGA KANs are robust, the removal of specific components (like LRA for Sine Gordon or RAD for Advection) can lead to divergence or significant error increases, highlighting the problem-dependent nature of these strategies.
Computational Cost: RGA KANs generally incur a higher computational cost per iteration than cPIKANs due to gating operations and basis function evaluations. However, in complex problems like Navier-Stokes, the cost gap narrows as the gating mechanisms become the primary bottleneck for both RGA KANs and PirateNets.

Significance and Claims
The paper claims that the proposed initialization and RGA KAN architecture jointly address the critical gap in deep physics-informed KANs. The authors assert that their work provides the first set of depth-scalable benchmarks for cPIKANs and demonstrates that deep KANs can be trained stably without diverging, a limitation previously observed in deep PINNs and cPIKANs. By successfully navigating the Information Bottleneck phases, RGA KANs achieve generalization capabilities that baseline architectures lack. The authors position their work not as a hyperparameter-tuned state-of-the-art for every specific PDE, but as a robust, unified framework that outperforms existing state-of-the-art architectures (PirateNets) and baseline KANs under a fixed, fair training pipeline. They suggest that their approach offers a strong foundation for future applications in operator learning and other KAN variants.

Training Deep Physics-Informed Kolmogorov-Arnold Networks