Polynomial, trigonometric, and tropical activations

Imagine you are building a massive, multi-story skyscraper (a Deep Neural Network) to solve complex problems like recognizing cats in photos or predicting the next word in a sentence.

In traditional construction, every floor uses the exact same type of "door" to let information pass through. In AI, these doors are called activation functions. For years, the industry standard has been a door called ReLU (or its fancy cousin, GELU). It's a simple, reliable door that opens only if the signal is positive. It works well, but it's a bit rigid.

This paper proposes a radical new idea: What if we replace those standard doors with a toolbox of mathematical "super-doors" based on polynomials, waves, and tropical geometry?

Here is the breakdown of their discovery, explained through everyday analogies.

1. The Problem: The "Exploding Elevator"

In a deep network, information travels from the ground floor to the top. If the "doors" on each floor are too aggressive, the signal gets amplified until it explodes (like a microphone screeching). If they are too weak, the signal dies out before reaching the top.

For a long time, people thought Polynomial doors (doors that use complex math formulas like $x^2$ or $x^3$ ) were dangerous because they tend to make signals explode. To fix this, engineers usually had to add "clamps" or safety valves (like ReLU) to keep things stable.

The Paper's Insight: The authors realized the problem wasn't the type of door, but how they were installed. They found a way to install these complex polynomial doors so that the signal stays perfectly balanced, floor to floor, without needing safety clamps.

2. The Solution: The "Orthonormal" Blueprint

The authors introduce three new types of doors, all based on special mathematical families called orthogonal bases. Think of these as perfectly tuned musical instruments that don't interfere with each other.

Hermite Activations (The "Gaussian" Door):
- The Analogy: Imagine a bell curve (the shape of a normal distribution). Hermite polynomials are like a set of perfectly tuned springs that vibrate exactly in sync with this bell curve.
- Why it works: Because they are "orthogonal" (independent), the math for calculating how much the signal grows is incredibly simple. The authors found a specific "recipe" for the coefficients (the settings on the door) that guarantees the signal stays the same size as it passes through.
Fourier Activations (The "Wave" Door):
- The Analogy: Think of sound waves or ocean tides. These doors use sine and cosine waves.
- Why it works: If your data is spread out evenly (like a uniform distribution), these waves are the perfect fit. They oscillate in a way that preserves the energy of the signal, preventing it from dying out or exploding.
Tropical Activations (The "Lego" Door):
- The Analogy: This is the most exotic one. Instead of smooth curves, imagine a roof made of flat, straight Lego blocks. In "Tropical Math," addition becomes "taking the maximum," and multiplication becomes "adding."
- Why it works: This creates a shape that looks like a series of connected ramps. It's a very efficient, "piecewise linear" way to process data, similar to how ReLU works but with more flexibility. It's essentially a "max-plus" version of a polynomial.

3. The Magic Trick: "Variance-Preserving Initialization"

The authors didn't just invent new doors; they invented a new installation manual.

Usually, when you use complex math functions, you have to guess the starting settings. If you guess wrong, the building collapses. The authors derived a mathematical formula (based on the properties of these specific bases) that tells you exactly how to set the knobs on the door before you even start training.

The Result: When you use their "magic recipe," the signal entering the door has the same "volume" (variance) as the signal leaving it. This allows you to build incredibly deep networks (like GPT-2 for language or ConvNeXt for images) without the signal getting lost or screaming.

4. Real-World Proof: It Actually Works

The authors didn't just do math on a whiteboard; they built the skyscrapers.

Image Classification: They swapped the standard doors in a state-of-the-art image model (ConvNeXt) with their new polynomial doors. On the massive ImageNet dataset (millions of cat/dog/car photos), these new doors performed better than the standard ones.
Language Modeling: They did the same for a language model (GPT-2) predicting the next word. Again, the new doors improved performance.

The "Fine-Tuning" Bonus:
One of the coolest features is that these new doors can be "shaped" to look exactly like the old standard doors (like GELU) at the start. Imagine you have a building already built with GELU doors. You can swap them out for these new polynomial doors, but you "mold" the new doors to look exactly like the old ones for the first few seconds. This allows you to upgrade an existing AI model without breaking it, and then let the new doors learn to be even better.

5. Why This Matters

Simplicity: It proves that you don't need complex "safety clamps" (like ReLU) to train deep networks. You just need the right mathematical foundation.
Interpretability: The authors show that a network with polynomial doors is essentially a giant, multi-variable polynomial equation. This makes the "black box" of AI slightly more transparent, turning it into a math problem we can actually analyze.
Efficiency: While some of these doors are slightly slower to compute than ReLU, they are surprisingly fast on modern hardware, and the performance gains often outweigh the cost.

Summary

This paper is like discovering that you don't need to use standard, rigid bricks to build a skyscraper. You can use flexible, mathematically perfect springs (Hermite), waves (Fourier), or interlocking ramps (Tropical). As long as you follow the new installation manual (variance-preserving initialization), these flexible materials build taller, stronger, and smarter structures than the old rigid bricks ever could.

They have effectively opened the door to a new era where neural networks are viewed not just as black boxes, but as elegant, learnable polynomial mappings.

1. Problem Statement

Deep neural networks (DNNs) traditionally rely on fixed, non-polynomial activation functions (e.g., ReLU, GELU, SiLU) to introduce non-linearity. While these have been successful, the use of polynomial activations has historically been discouraged due to the Universal Approximation Theorem, which suggests non-polynomial functions are necessary for universal approximation. Furthermore, polynomial networks often suffer from:

Exploding or vanishing activations and gradients: Without careful initialization, the variance of signals and gradients can diverge across layers, destabilizing training.
Lack of closed-form initialization: Previous attempts to use learnable rational or polynomial activations (e.g., Yang & Wang, 2025) struggled to derive closed-form expressions for second-order moments, often requiring heuristic fitting to classical activations.
Limited theoretical understanding: The structural properties of deep networks with polynomial activations were not fully characterized in terms of their mapping capabilities.

The paper aims to demonstrate that orthogonal polynomial and trigonometric bases, combined with a specific variance-preserving initialization, can successfully train deep models (like GPT-2 and ConvNeXt) without auxiliary stabilization mechanisms (like clamping or normalization).

2. Methodology

The core methodology revolves around three pillars: Orthogonal Basis Functions, Variance-Preserving Initialization, and Tropical Geometry.

A. Variance-Preserving Initialization

The authors extend the work of He et al. (2015) to ensure that the variance of the output signal and the variance of the gradient remain constant (unitary) across layers.

Forward Gain ( $\alpha$ ): Defined as $Var[x] \cdot E[F(x)^2]^{-1}$ .
Backward Gain ( $\alpha'$ ): Defined as $Var[x] \cdot E[F'(x)^2]^{-1}$ .
Goal: Initialize learnable coefficients such that $\alpha = \alpha' = 1$ (or a constant scaling factor), preventing signal explosion/vanishing.

B. Three Novel Activation Families

The paper proposes three specific families of activations, each derived from an orthonormal basis suitable for a specific input distribution:

Hermite Activations (Polynomial):
- Basis: Probabilist Hermite polynomials ( $He_n$ ).
- Input Distribution: Standard Normal $N(0, 1)$ .
- Initialization: Coefficients are initialized such that $a_k = 1$ for $k \ge 1$ and $a_0 = \sqrt{1 - 1/n!}$ . In the limit, coefficients are scaled by $1/\sqrt{e}$ .
- Property: The orthogonality of Hermite polynomials with respect to the Gaussian weight function allows for a closed-form calculation of the second moment $E[F(x)^2]$ .
Fourier Activations (Trigonometric):
- Basis: Sine and Cosine functions (Fourier series).
- Input Distribution: Uniform $U(-\pi, \pi)$ .
- Initialization: Coefficients initialized to 1, scaled by $1/\sqrt{I_0(2)}$ (where $I_0$ is the modified Bessel function) in the limit.
- Property: Orthogonality over the uniform interval enables closed-form moment calculation. The implementation uses an Amplitude-Phase formulation for efficiency.
Tropical Activations (Max-Plus):
- Basis: Tropical polynomials defined in the max-plus semiring ( $\max, +$ ).
- Definition: $F(x) = \max_{k} \{a_k + kx\}$ .
- Interpretation: These act as the discrete convex conjugate of a learnable function, encoding the convex hull of its epigraph. They generalize ReLU (which is a tropical polynomial of degree 1).
- Initialization: Coefficients initialized to 1, with a scaling factor of $\sqrt{2/n}$ for unitary gain.

C. Implementation Details

Learnable Coefficients: Unlike fixed activations, the coefficients ( $a_k$ ) are learnable parameters.
No Weight Decay: Coefficients are trained without weight decay to prevent bias toward zero.
Efficient Kernels: The authors developed custom PyTorch/CUDA kernels. For Hermite, they utilize a recurrence relation ( $He_{n+1} = x He_n - n He_{n-1}$ ) to reduce complexity from $O(d^2)$ to $O(d)$ .
Initialization via Hermite Interpolation: To fine-tune pre-trained models, the authors propose fitting classical activations (like GELU) using Hermite interpolation (matching both function and derivative values) rather than simple Lagrange interpolation, which avoids aliasing in derivatives.

3. Key Contributions

Novel Initialization Scheme: A mathematically derived, variance-preserving initialization for orthogonal learnable activations that ensures stable training dynamics without clamping.
Theoretical Proof of Polynomial Mappings: The paper proves (Appendix F) that deep neural networks with polynomial activations are equivalent to multivariate polynomial mappings. This challenges the notion that polynomials are unsuitable for deep learning and provides a geometric interpretation (algebraic varieties).
Tropical Interpretation: Demonstrates that tropical activations encode the convex hull of a function's epigraph, linking tropical geometry to neural network decision boundaries.
Empirical Validation on Large-Scale Models: Successfully trained ConvNeXt-T on ImageNet-1k and GPT-2 on OpenWebText using these activations, achieving performance comparable to or better than GELU/SiLU baselines.
Open Source Library: The activations are implemented in the torchortho library.

4. Experimental Results

The authors evaluated the activations on image classification and language modeling tasks:

ImageNet-1k (ConvNeXt-T):
- Hermite (Degree 3): Achieved 82.22% Top-1 accuracy (vs. 82.06% for GELU).
- Tropical (Degree 6): Achieved 82.17% Top-1 accuracy.
- Fourier (Degree 6): Achieved 81.64% Top-1 accuracy.
- Significance: All proposed activations outperformed the GELU baseline with statistical significance ( $p < 0.05$ ).
OpenWebText (GPT-2 124M):
- Hermite (Degree 3): Achieved 18.678 Training PPL and 18.821 Validation PPL (vs. 19.003/19.319 for GELU).
- Tropical (Degree 6): Achieved 18.840 Training PPL.
- Fourier (Degree 6): Achieved 18.761 Training PPL.
- Significance: All proposed activations significantly reduced perplexity compared to GELU and SiLU.
Ablation Studies:
- Degree: Higher degrees generally improved performance.
- Learnability: Making coefficients learnable was crucial; fixed coefficients performed poorly.
- Initialization: The proposed variance-preserving initialization outperformed initialization by fitting to GELU.
- Clamping: Removing clamping mechanisms did not destabilize training, confirming the efficacy of the initialization.
Efficiency:
- Parameter overhead is negligible (e.g., Hermite adds ~0.0002% parameters).
- Computational cost scales linearly with degree ( $O(d)$ ). On GPUs, low-degree variants show near-constant scaling due to vectorization.

5. Significance and Future Work

Paradigm Shift: The work challenges the dogma that polynomial activations are inherently unstable or unsuitable for deep learning. It shows that with the correct mathematical framework (orthogonal bases + variance preservation), they are highly effective.
Interpretability: By framing DNNs as polynomial mappings, the paper opens the door to analyzing neural networks through the lens of algebraic geometry (neuromanifolds and neurovarieties), potentially improving model interpretability and identifiability.
Fine-Tuning: The ability to approximate classical activations via Hermite interpolation provides a robust strategy for transferring knowledge from pre-trained models to new architectures using learnable activations.
Future Directions: The authors suggest extending this framework to wavelets (Hermitian and Morlet) and exploring complex-valued neural networks via the complex form of Fourier activations.

In conclusion, this paper provides a rigorous theoretical and empirical foundation for replacing static activation functions with learnable orthogonal and tropical bases, offering a path toward more efficient, interpretable, and stable deep learning models.