Revisiting Chebyshev Polynomial and Anisotropic RBF Models for Tabular Regression

Imagine you are trying to predict the future based on a spreadsheet of data. Maybe you're guessing house prices, predicting how much electricity a factory will use, or estimating how happy a customer will be with a product.

For a long time, the "gold standard" tool for this job has been Tree Ensembles (like Random Forests or XGBoost). Think of these as a team of very strict, rule-following detectives. They look at your data and slice it up like a pie: "If the price is over $500k, go left. If the bedroom count is 3, go right." They are incredibly good at finding patterns and winning competitions, but they are a bit rigid. Their predictions jump around like a staircase; a tiny change in input can cause a sudden, big jump in the output.

This paper asks a simple question: What if we tried a different kind of detective?

The authors tested two older, "smoother" mathematical tools: Chebyshev Polynomials and Radial Basis Functions (RBFs).

The Analogy: If Tree Ensembles are a staircase, these smooth models are a ramp. They don't just jump from one level to another; they glide. If you nudge the input slightly, the prediction nudges slightly.

The Big Experiment

The researchers didn't just look at one dataset; they tested these "smooth detectives" against the "staircase detectives" on 55 different real-world problems, ranging from physics simulations to economic pricing. They also threw in a super-smart AI (a Transformer) and some basic baselines to see who came out on top.

Here is what they found, broken down into simple concepts:

1. The Accuracy Race (Who is the smartest?)

The Winner: A pre-trained AI (TabPFN) won the most often, but it's like a supercomputer that needs a massive GPU (graphics card) to run. It's too heavy and expensive for many everyday businesses.
The CPU Race: When we look at models that can run on a standard computer (no super-GPU needed), the results were a dead heat. The "smooth" models (Chebyshev and RBF) were just as accurate as the famous "staircase" models (Tree Ensembles). They are equally smart.

2. The "Overfitting" Problem (Who learns the lesson vs. memorizes the answers?)

This is the paper's biggest discovery.

The Staircase Models (Trees): They are great at memorizing the specific training data. But because they are so rigid, they sometimes get "confused" by small changes. They might predict a house price of $500k for a house with 3 bedrooms, but $600k for a house with 3.01 bedrooms. This is overfitting—they are too sensitive to the specific details of the training set.
The Smooth Models: Because they glide instead of jump, they are more stable. They didn't memorize the noise; they learned the underlying trend.
The Result: When the models had the same accuracy, the smooth models made fewer mistakes on new data they hadn't seen before. They had a "tighter generalisation gap." In 87% of cases where they were equally smart, the smooth models were more reliable.

3. The Cost and Speed

Training: The smooth models were generally faster and cheaper to train than the complex tree models.
Predicting: Once trained, the smooth models were incredibly fast at making predictions, often faster than the tree models.
The Catch: One of the smooth models (the RBF) took a bit longer to "tune" (set up), but once it was ready, it was a speed demon.

Why Does "Smoothness" Matter?

You might ask, "If they are equally accurate, why do I care if the line is smooth or jagged?"

The authors give two great reasons:

Real-World Logic: In the real world, things rarely jump instantly. If you increase your speed by 1 mph, your fuel consumption doesn't suddenly double; it changes gradually. Smooth models respect this physics.
Optimization: If you are using a computer to design a new airplane wing or a chemical formula, you need to nudge the variables to find the perfect spot. If your model is a staircase, the computer gets stuck on the steps and can't find the peak. If your model is a smooth ramp, the computer can glide right to the best solution.

The Bottom Line

The paper concludes that we shouldn't just default to "Tree Ensembles" for every problem.

Use Trees if your data has hard, sudden rules (like tax brackets or "if/then" business logic).
Use Smooth Models if your data represents physical processes, human behavior, or anything that changes gradually.

The Takeaway: The authors are telling data scientists: "Don't just reach for the hammer (Trees) because it's the most popular tool. Sometimes, a screwdriver (Smooth Models) does the exact same job, but with a smoother finish and better reliability." They recommend always keeping these smooth models in your toolbox, just in case.

1. Problem Statement

Despite the dominance of tree ensembles (e.g., Random Forests, XGBoost) in tabular regression benchmarks, smooth-basis models from numerical analysis—specifically Chebyshev polynomials and Radial Basis Function (RBF) networks—are rarely utilized in this domain.

The Gap: Tree ensembles excel at predictive accuracy but produce discontinuous, piecewise-constant prediction surfaces. This limits their utility in applications requiring gradient-based optimization, sensitivity analysis, or smooth interpolation (e.g., engineering design, surrogate modeling).
The Question: Can modernized smooth-basis models compete with tree ensembles in predictive accuracy on tabular data while offering superior generalization and smoothness?
Evaluation Metric: The authors argue that generalization gap (the difference between training and test performance) is a critical, under-reported metric that signals model stability and overfitting, distinct from raw accuracy.

2. Methodology

A. Proposed Models

The authors developed three new, scikit-learn-compatible models:

Anisotropic RBF Network (erbf):
- Architecture: Uses Gaussian basis functions with anisotropic widths (a separate width parameter for each feature dimension), allowing the model to adapt to local data structure.
- Innovation: A three-stage decoupled training pipeline to avoid the non-convex optimization issues of classical RBFs:
  1. Center Placement: Uses Lipschitz-guided sampling (supervised) to place centers in regions of high target variation, or K-means (unsupervised).
  2. Width Initialization: Uses local ridge regression or local variance to initialize widths.
  3. Width Optimization: Optimizes widths in log-space using L-BFGS-B with analytical gradients, keeping centers fixed.
- Output: Ridge regression on the activation matrix.
Chebyshev Polynomial Regressor (chebypoly):
- Architecture: Expands input features into a basis of Chebyshev polynomials (orthogonal on $[-1, 1]$ ), which are numerically stable compared to monomial bases.
- Features: Includes optional pairwise interaction terms and Ridge regularization to prevent overfitting.
- Output: A globally smooth, continuously differentiable predictor.
Chebyshev Model Tree (chebytree):
- Architecture: A hybrid approach. A decision tree partitions the feature space into regimes; within each leaf, a local Chebyshev polynomial regressor is fitted.
- Goal: Captures discontinuities via tree splits while maintaining smoothness within regimes.

B. Benchmark Design

Datasets: 55 regression datasets across four domain strata: Engineering/Simulation, Behavioral/Social, Physics/Chemistry/Life Sciences, and Economics/Pricing.
Comparators:
- Smooth/Tree Hybrid: erbf, chebypoly, chebytree.
- Tree Ensembles: Random Forest (rf), XGBoost (xgb).
- Baselines: Ridge Regression (ridge), Decision Tree (dt).
- Transformer: Pre-trained TabPFN (tabpfn).
Protocol: Nested cross-validation with Optuna-based hyperparameter tuning.
Metrics: Adjusted $R^2$ (accuracy), Generalization Gap ( $R^2_{train} - R^2_{test}$ ), and Computational Cost.

3. Key Contributions

Multi-Axis Benchmarking: Established a rigorous evaluation framework that treats generalization gap as a primary axis alongside accuracy, revealing that models with similar accuracy can have vastly different overfitting behaviors.
Novel Implementations: Released erbf (anisotropic RBF with decoupled training), chebypoly, and chebytree as open-source Python packages (erbf, poly-basis-ml).
Empirical Evidence: Demonstrated that smooth models are statistically competitive with tree ensembles in accuracy but superior in generalization stability.

4. Key Results

Predictive Accuracy

TabPFN ranked first overall but is constrained by GPU requirements, dataset size limits (50k samples), and high inference latency.
CPU-Viable Models: Among models runnable on standard hardware, erbf, chebytree, xgb, chebypoly, and rf are statistically indistinguishable in terms of accuracy (Friedman test, $p > 0.05$ ).
Domain Specificity:
- Smooth models (erbf, chebypoly) performed best in Engineering/Simulation and Physical Sciences (domains with smooth underlying functions).
- Tree ensembles (xgb) performed slightly better in Economics/Pricing (domains with thresholds and discrete rules).
- erbf struggled significantly with non-continuous (discrete/ordinal) targets, while chebytree excelled there.

Generalization Gap (Stability)

Smooth models exhibit significantly tighter generalization gaps than tree ensembles.
When matched for accuracy ( $|\Delta R^2| \le 0.02$ ), smooth models had a smaller gap in 87% of pairwise comparisons against tree ensembles.
chebypoly achieved the best gap rank, followed by erbf. xgb had the largest gap among competitive models.
Implication: Smooth models are less sensitive to specific training samples, suggesting higher algorithmic stability.

Computational Cost

Training/Tuning: chebypoly and chebytree are the fastest to tune (fitting reduces to ridge regression). erbf is slower due to iterative width optimization.
Inference: Once trained, erbf and chebypoly offer fast inference, competitive with XGBoost.
Scalability: chebytree and chebypoly scaled well to full-size datasets (up to 580k samples, 1k features) without feature selection, whereas erbf becomes expensive at high dimensions.

5. Significance and Recommendations

Rethinking the Default: The paper challenges the industry default of automatically selecting gradient-boosted trees for tabular regression.
When to Use Smooth Models:
- Downstream Optimization: Essential for gradient-based optimization where tree discontinuities cause optimizers to get stuck.
- Sensitivity Analysis: Required for calculating partial derivatives or understanding smooth feature interactions.
- User Trust: In consumer applications (e.g., loan calculators), smooth models prevent "jumps" in output for negligible input changes.
- Generalization Robustness: When the generalization gap is a priority over marginal accuracy gains.
Conclusion: Smooth-basis models should be routinely included in the candidate pool for tabular regression. They offer a Pareto-optimal trade-off: matching the accuracy of tree ensembles while providing superior generalization, interpretability, and smoothness.