Multilevel Training for Kolmogorov Arnold Networks

Imagine you are trying to teach a robot to draw a perfect picture of a mountain range. The mountain has smooth slopes, but also jagged, rocky cliffs and sharp peaks.

The Problem: The "Smooth" Robot
Most standard AI models (called MLPs) are like artists who only know how to use smooth, curved brushes. They are great at drawing gentle hills, but when they try to draw a jagged cliff, they get stuck. They keep trying to smooth out the sharp edges, making the picture blurry and inaccurate. To fix this, you usually have to give them a massive canvas with millions of tiny brushstrokes, which takes forever to paint.

The New Tool: The "Scalpel" Robot (KANs)
Enter Kolmogorov-Arnold Networks (KANs). Think of these as artists who use a set of specialized "scalloped" brushes (called splines). These brushes are naturally shaped to fit both smooth curves and sharp, jagged edges perfectly. They are much better at capturing the details of the mountain.

However, there was a catch: Teaching the "Scalpel" robot was slow and messy. It was like trying to organize a library where every book was written in a different, confusing dialect. The math behind the scenes was tangled, making it hard to train them efficiently.

The Breakthrough: The "Translator" and the "Ladder"
This paper introduces two clever tricks to make training these "Scalpel" robots fast and effective.

1. The Translator (The Change of Basis)

The authors realized that the "Scalpel" brushes (splines) and the "Smooth" brushes (standard AI) are actually speaking the same language, just with different accents.

The Analogy: Imagine you have a recipe written in French (Splines) and you want to cook it, but your kitchen only understands English (Standard AI). Instead of rewriting the whole recipe from scratch, the authors found a simple translator (a mathematical matrix).
The Result: This translator instantly converts the French recipe into English without changing the taste of the dish. But here's the magic: once converted, the cooking process becomes much faster. It turns a complex, recursive calculation (like a long, winding staircase) into a simple, straight-line calculation (like an elevator). This speeds up the initial training significantly.

2. The Ladder (Multilevel Training)

This is the paper's biggest innovation. Usually, when you want to draw a more detailed mountain, you just start over on a bigger canvas. This is inefficient.

The authors built a Ladder approach, inspired by how engineers solve massive construction problems.

The Coarse Step (The Rough Sketch): First, you train the robot on a tiny, low-resolution version of the mountain. You get the big shapes right (the general slope).
The Refined Step (Adding Detail): Instead of starting from scratch, you take that rough sketch and "zoom in" to a higher resolution. You add more splines (more brushstrokes) to the specific areas that need detail.
The Secret Sauce: The authors designed a special transfer mechanism (like a magic photocopier) that takes the progress made on the small sketch and perfectly maps it onto the big canvas.
- Why this matters: In standard AI, when you zoom in, the robot often forgets what it learned on the small scale and gets confused. In this new method, the robot keeps its progress. It starts the high-resolution training with a head start, knowing exactly where the big slopes are, so it can focus entirely on fixing the jagged rocks.

Why It Works So Well (The "Relaxation" Concept)

The paper explains that different parts of the mountain require different tools.

The Smooth Slopes: These are easy to learn. The "Scalpel" robot learns these quickly on the small scale.
The Jagged Rocks: These are hard. They require the high-resolution detail.

In standard AI, the robot tries to learn the rocks and the slopes at the same time, often getting stuck. In this Multilevel approach:

The robot learns the slopes on the small ladder rung.
When it moves up the ladder, it doesn't waste time re-learning the slopes. It immediately starts focusing its energy on the rocks.

This is like a student who masters basic arithmetic before moving to algebra. They don't re-learn how to add numbers every time they try to solve a complex equation; they build on what they already know.

The Results

The authors tested this on complex physics problems (like predicting how heat moves through a material or how fluids flow).

Standard AI: Struggled, took a long time, and the pictures were blurry.
New Method: The robot learned 10 to 1,000 times faster and produced incredibly sharp, accurate results. It could capture the "jagged rocks" of the data that other models missed.

Summary

Think of this paper as inventing a smart construction crew for AI.

They found a way to translate the crew's instructions so they work faster.
They built a ladder that lets the crew build a skyscraper floor-by-floor, ensuring that the foundation laid on the first floor is perfectly preserved as they build the 50th floor.

This means we can now train powerful, detailed AI models much faster, especially for scientific tasks where precision is everything.

Here is a detailed technical summary of the paper "Multilevel Training for Kolmogorov Arnold Networks" by Southworth et al.

1. Problem Statement

Training deep neural networks, particularly Multilayer Perceptrons (MLPs), often suffers from slow convergence due to the lack of inherent structural properties in function compositions. While Multigrid methods have revolutionized the solution of Partial Differential Equations (PDEs) by offering $O(n)$ complexity through hierarchical refinement, applying similar multilevel training strategies to machine learning has been historically difficult.

The primary obstacles are:

Lack of Hierarchical Structure: Defining "coarse" and "fine" models in ML is non-trivial because standard MLPs operate in the same dimensional space at all levels, lacking clear approximation properties between levels.
Non-Complementary Optimization: In successful multigrid methods, coarse grids correct low-frequency errors while fine grids correct high-frequency errors. In standard ML, refining a model often fails to provide complementary optimization; the fine model may simply re-learn what the coarse model already captured, or the optimizer may be biased toward smooth functions that cannot utilize the new capacity.

Kolmogorov-Arnold Networks (KANs) offer a potential solution. Unlike MLPs which learn weights for fixed activations, KANs learn the activation functions themselves, typically represented via B-splines. This paper posits that the spline parameterization of KANs provides the necessary mathematical structure to enable effective multilevel training.

2. Methodology

The authors develop a framework that bridges KANs, multichannel MLPs, and numerical multigrid theory through three main technical pillars:

A. Equivalence via Change of Basis

The paper establishes a rigorous mathematical equivalence between KANs with spline basis functions and multichannel MLPs with power ReLU activations ( $ReLU^{r-1}$ ).

Linear Transformation: A linear change-of-basis matrix, denoted $A^{[r]}$ , maps the spline coefficients (KAN weights) to the ReLU weights (MLP weights).
Differential Operator Structure: The matrix $A^{[r]}$ is shown to be a discrete finite-difference approximation of the $r$ -th derivative operator. Consequently, the product $(A^{[r]})^T A^{[r]}$ approximates the $2r$-th derivative.
Computational Speedup: This equivalence allows for a non-recursive implementation of KANs. Instead of the computationally expensive Cox-de Boor recursion for B-splines, the network can be evaluated using the ReLU basis followed by a linear transformation, resulting in a speedup proportional to the spline degree $r$ .

B. Gradient Descent Dynamics and Preconditioning

The authors analyze how the choice of basis affects gradient descent dynamics:

Preconditioning Effect: The change of basis acts as a preconditioner.
- In the ReLU basis (MLP), the preconditioner $(A A^T)^{-1}$ strongly penalizes high-frequency (oscillatory) modes relative to smooth modes. This leads to "spectral bias," where the optimizer struggles to learn sharp gradients or low-regularity functions, even if the model capacity exists.
- In the Spline basis (KAN), the weights correspond to local basis functions with compact support. Gradients naturally update local regions, allowing the optimizer to efficiently learn both smooth and oscillatory functions without the severe spectral bias of the ReLU formulation.

C. Properly Nested Hierarchies

To enable multilevel training, the authors introduce the concept of a "Properly Nested Hierarchy":

Definition: A hierarchy is properly nested if the action of a fine model, when weights are interpolated from a coarse model, exactly matches the action of the coarse model.
Implementation: By using uniform refinement of spline knots, the authors define geometric interpolation operators ( $P$ ) that map coarse spline coefficients to fine spline coefficients.
Complementarity: Because the spline basis allows for local updates, refining the grid adds expressivity specifically for higher-frequency modes. The optimization on the fine grid complements the coarse grid (which handles low frequencies), satisfying the fundamental requirement of multigrid methods.

3. Key Contributions

Theoretical Equivalence: Proved that spline-based KANs are equivalent to multichannel MLPs with power ReLU activations via a specific linear change of basis, revealing that the transformation matrix is a discretized derivative operator.
Optimization Insight: Demonstrated that the change of basis acts as a preconditioner that drastically alters training dynamics. The spline basis enables complementary optimization across levels, whereas the ReLU basis does not.
Multilevel Algorithm: Designed a multilevel training algorithm for KANs using geometric knot refinement and analytic interpolation operators, ensuring a properly nested hierarchy.
Computational Efficiency: Developed a non-recursive, faster implementation of spline-based KANs by leveraging the ReLU equivalence.

4. Results

The authors validated their approach on three distinct tasks, comparing Multilevel KANs (Spline), Multilevel KANs (ReLU), and Standard MLPs:

Function Regression:
- Multilevel training with the Spline basis achieved orders of magnitude lower error (up to $10^{-5}$) compared to training a single fine model or a comparable MLP.
- Multilevel training with the ReLU basis showed zero improvement over the coarse model, confirming that the ReLU basis fails to utilize the refined grid due to spectral bias.
Physics-Informed Neural Networks (PINNs) - 2D Poisson Equation:
- The Multilevel Spline KAN converged significantly faster and with lower noise than standard MLPs.
- The loss curve exhibited a "stair-casing" effect: loss plateaued on a coarse grid, then dropped rapidly upon refinement, demonstrating immediate exploitation of new expressivity.
- The ReLU-based multilevel approach stagnated at a high error level ( $O(1)$ relative error).
PINNs - 1D Burger's & Allen-Cahn Equations:
- For low-regularity solutions (Burger's) and complex dynamics (Allen-Cahn), Multilevel Spline KANs outperformed comparable MLPs by 2–3 orders of magnitude in accuracy.
- Spectral analysis of residuals showed that Multilevel Spline training successfully reduced energy in high-frequency Fourier modes as the grid refined, whereas ReLU networks failed to broaden their spectral support.

5. Significance

This work represents a significant breakthrough in the intersection of numerical analysis and deep learning:

Bridging Multigrid and ML: It provides the first rigorous demonstration that multigrid principles (proper nesting and complementary relaxation) can be successfully applied to neural network training, yielding algorithmic speedups rather than just parallelization benefits.
Justification for KANs: It offers a theoretical explanation for why KANs often outperform MLPs on scientific computing tasks: their spline parameterization naturally aligns with the requirements of multilevel optimization and handles low-regularity functions better due to localized basis functions.
Practical Impact: The proposed method achieves dramatic improvements in accuracy and efficiency for Physics-Informed Neural Networks (PINNs) without requiring specialized training tricks (like time-causality or adaptive sampling), making robust scientific machine learning more accessible.

In summary, the paper argues that principled design (using splines) creates exploitable structure, which, when combined with multilevel algorithms, unlocks performance gains previously thought impossible in deep learning training.

Multilevel Training for Kolmogorov Arnold Networks

1. The Translator (The Change of Basis)

2. The Ladder (Multilevel Training)

Why It Works So Well (The "Relaxation" Concept)

The Results

Summary

1. Problem Statement

2. Methodology

A. Equivalence via Change of Basis

B. Gradient Descent Dynamics and Preconditioning

C. Properly Nested Hierarchies

3. Key Contributions

4. Results

5. Significance

More like this

Partial Sums of the Series for the Dirichlet Eta Function, their Peculiar Convergence, the Simple Zeros Conjecture, and the RH

Triangular arrangements on the projective plane

Some arithmetic properties of Weil polynomials of the form t2g+atg+qgt^{2g}+at^g+q^gt2g+atg+qg

Big Picard theorems and algebraic hyperbolicity for varieties admitting a variation of Hodge structures

On the dual positive cones and the algebraicity of a compact Kähler manifold

Some arithmetic properties of Weil polynomials of the form $t^{2g}+at^g+q^g$