KANs need curvature: penalties for compositional… — Plain-Language Explanation

The Problem: The "Jagged" Solution

Imagine you are trying to teach a robot to draw a smooth, flowing curve, like a sine wave. You give the robot a special set of tools called KANs (Kolmogorov–Arnold Networks). These tools are great because, unlike standard AI that works like a black box, KANs let you see exactly how they are drawing the picture. Each "brushstroke" (activation function) is visible and understandable.

However, the paper found a glitch. When these robots try to fit the data perfectly, they often get "jittery." Instead of drawing a smooth line, they draw a line that looks like a jagged mountain range or a scribble. It fits the data points perfectly, but it looks nothing like the smooth curve you expected.

The authors call this "high-curvature oscillation." In plain English: the robot is overthinking and adding unnecessary wiggles and kinks to its drawing.

The Old Fix: The "Lazy" Penalty

Previously, scientists tried to stop this jitter by using a standard "penalty." Think of this like a teacher telling the robot, "Don't use too much ink."

The Problem: This penalty only checks how much ink is used (the magnitude), not how it is used.
The Result: A robot can use a tiny bit of ink to draw a smooth line, or a tiny bit of ink to draw a crazy, jagged scribble. The old penalty can't tell the difference. It's like a teacher who only counts the number of words in an essay but doesn't read the sentences to see if they make sense. The robot keeps drawing jagged lines because the penalty doesn't "see" the jaggedness.

The New Fix: The "Smoothness" Penalty

The authors invented a new, smarter penalty. Instead of just counting ink, this new penalty measures the "bending energy" of the lines.

The Analogy: Imagine you are bending a flexible ruler. If you bend it gently into a smooth arc, it takes very little effort. If you try to twist it into a sharp zig-zag, it takes a lot of effort and energy.
The Solution: The new penalty charges the robot a "fee" based on how much energy it takes to bend its lines. If the robot tries to draw a jagged zig-zag, the fee is huge. If it draws a smooth curve, the fee is low.
The Outcome: The robot learns that to keep its "fee" low, it must draw smooth lines. The paper shows that with this new penalty, the robots can still draw the picture perfectly accurately, but the lines are now smooth, readable, and look like the real function they are trying to mimic.

Why This Matters: The "Chain Reaction"

One might ask: "If we just smooth out the individual brushstrokes, does the whole picture stay smooth?"

The Concern: In a deep network, the output of one layer becomes the input for the next. It's like a chain reaction. If the first layer is a bit wobbly, the next layer might amplify that wobble into a huge mess.
The Discovery: The authors proved mathematically that if you smooth out the individual edges (the brushstrokes), you automatically put a "ceiling" on how messy the whole picture can get. By controlling the small parts, you control the whole.
The Bonus: They also found a way to make this even better by weighting the penalty. Some brushstrokes are more important to the final picture than others. By paying extra attention to the "important" strokes, the robot learns even faster and more accurately.

The Big Win: Stability and Simplicity

Before this, if a robot got too complex (overparameterized), it would become unstable and crash. To fix this, scientists had to use a complicated, multi-step training process: start with a simple grid, train, then switch to a complex grid, and start over. It was like building a house, then tearing it down to build a bigger one.

With this new "smoothness penalty," the robot can handle complex, high-resolution grids right from the start. It stays stable without needing the complicated multi-step process.

Summary

The Issue: AI models (KANs) that are supposed to be interpretable often draw jagged, messy lines that are hard to understand.
The Old Way: Tried to stop this by limiting the "size" of the lines, which didn't work.
The New Way: Introduced a penalty that charges for "bending" or "wiggling." This forces the AI to draw smooth, clean lines.
The Result: The AI remains just as accurate, but the results are smooth, stable, and much easier for humans to interpret. It turns a "black box" into a clear, readable sketch.

Technical Summary: KANs Need Curvature: Penalties for Compositional Smoothness

Problem Statement
Kolmogorov–Arnold networks (KANs) offer a compelling alternative to traditional neural networks by replacing fixed nonlinearities with learnable univariate activation functions on edges, promising both high accuracy and interpretability. However, a critical flaw limits their practical utility in scientific machine learning: well-fitting KANs frequently develop "pathologically high-curvature oscillations" in their activation functions. While these models fit data accurately, the resulting "kink-like" oscillations render the learned functions unreadable and difficult to interpret. The authors argue that standard regularization penalties used in KANs (specifically the magnitude and entropy penalties proposed by Liu et al.) are structurally incapable of preventing this. These standard penalties depend only on the average magnitude of activations, carrying no derivative information; thus, a wildly oscillating function incurs the same penalty as a smooth one if their average magnitudes are identical.

Methodology
To address the lack of smoothness, the authors propose a basis-agnostic curvature penalty derived from the theory of penalized splines (P-splines).

Derivation of the Edge-Wise Penalty:
The authors define the curvature of a univariate activation function $\phi_e$ as its $L_2$ bending energy, $\int (\phi_e''(z))^2 dz$ . By substituting the KAN activation form (a linear combination of a base function, typically SiLU, and B-splines), they derive a closed-form penalty operating directly on the model coefficients:
$R(f) = \sum_{e} \left( \|D_2(\beta_e c_e)\|^2 + K_{\text{silu}} \alpha_e^2 \right)$
Here, $D_2$ is the second-difference matrix acting on the spline coefficients $c_e$ , $\beta_e$ scales the spline, and $\alpha_e$ scales the base function. The term $K_{\text{silu}}$ is a constant derived from the second derivative of the SiLU function. This penalty is applied edge-wise and is independent of the training data distribution.
Theoretical Analysis of Compositional Curvature:
Recognizing that edge-wise smoothness does not automatically guarantee the smoothness of the full composed function, the authors perform a compositional analysis. They derive the Hessian of the full network function using the chain rule, leveraging the specific structure of KANs where layer Hessians are diagonal (due to univariate edges).
They prove Theorem 1, which establishes that the proposed edge-wise penalty $R(f)$ serves as a rigorous upper bound on the true composition-level curvature $\mathcal{R}(f)$ (defined as the expected squared Frobenius norm of the input Hessian). This proof relies on three structural assumptions regarding path weights, activation density, and knot spacing, showing that minimizing the edge-wise penalty effectively minimizes a bound on the global curvature.
Weighted Extension:
The authors further propose a "richer" weighted penalty that incorporates the expected path weights ( $\bar{w}_e$ ) derived from the chain rule decomposition. This variant scales the penalty for each edge by its expected impact on the global Hessian, though it reintroduces a dependency on the training data distribution.

Key Contributions

Structural Limitation of Existing Penalties: The paper demonstrates that the standard KAN penalty cannot enforce smoothness because it lacks derivative information, making it impossible to distinguish between smooth and oscillatory functions of equal magnitude.
Basis-Agnostic Curvature Penalty: The authors derive a closed-form, coefficient-based curvature penalty that can be applied to any fixed basis with square-integrable second derivatives (e.g., B-splines).
Theoretical Upper Bound: Through compositional analysis, the paper proves that the edge-wise penalty upper-bounds the curvature of the full network, providing a theoretical justification for using local penalties to control global smoothness.
Empirical Validation: The study shows that curvature-penalized KANs achieve substantially smoother activations while maintaining accuracy comparable to unpenalized or standard-penalized models across function approximation, the Feynman symbolic regression benchmark, and overparameterized regimes.

Results

Function Approximation: In experiments approximating functions like $f(x, y) = \sin(x + y^2)$ and $f(x, y) = \exp(\sin(\pi x) + y^2)$ , curvature-penalized models produced activation functions that visually aligned with the true components (e.g., smooth sine and polynomial curves), whereas unpenalized models exhibited high-frequency oscillations.
Feynman Benchmark: On 14 equations from the Feynman symbolic regression benchmark, curvature-penalized KANs achieved the lowest total edge curvature in all 14 cases. In terms of accuracy (Test RMSE), they matched or outperformed the standard KAN penalty in 9 out of 14 equations, and were within a factor of two of the best accuracy in all cases.
Stability in Overparameterized Regimes: The curvature penalty significantly stabilized training for overparameterized KANs (high grid size $G$ ). Unlike the standard KAN penalty, which plateaued early, the curvature-penalized models continued to improve over 3000 epochs. Furthermore, the penalty enabled stable training with high-resolution grids ( $G=200$ ) without the need for "grid extension" (a multi-stage training process starting with low $G$ ), achieving test RMSEs of $\sim 10^{-3}$ where unpenalized models failed catastrophically.
Optimizer Independence: The benefits of the curvature penalty were observed with both Adam and L-BFGS optimizers.
Weighted Penalty: A 10-seed comparison showed that the weighted curvature penalty (incorporating path weights) reduced the mean test RMSE by a factor of 2.2 compared to the uniform edge-wise penalty.

Significance and Claims
The paper claims that the curvature penalty provides a "single, principled smoothness lever" for KANs. Its significance lies in three areas:

Interpretability: By enforcing smooth activations, the penalty makes the internal representations of KANs readable and aligned with the scientific intuition that physical laws are typically smooth, thereby strengthening KANs as a tool for scientific machine learning.
Training Stability: It resolves the instability of training high-resolution KANs, allowing for single-stage, end-to-end optimization without the need for complex multi-stage grid extension protocols. This is crucial for integrating KANs into broader systems like neural architecture search or meta-learning.
Architectural Advantage: The analysis highlights that the diagonal structure of KAN Hessians (a result of univariate edges) is a unique structural advantage that allows for interpretable per-edge attribution of compositional curvature, a property not present in standard MLPs.

The authors conclude that smoothness is not merely an added feature but a controllable property inherent to the KAN architecture, and that managing this property via curvature penalties is essential for realizing the full potential of KANs in interpretable scientific discovery.

KANs need curvature: penalties for compositional smoothness