Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules

The Big Idea: Stop Guessing the "Average," Start Predicting the "Whole Picture"

Imagine you are trying to predict the weather for a picnic.

The Old Way (Point Estimates):
Most AI models today act like a weatherman who only gives you one number: "The average temperature will be 72°F."

The Problem: What if the real weather is actually a 50/50 split between a freezing 40°F morning and a scorching 100°F afternoon? The average of 72°F is technically "correct" mathematically, but it's useless for planning. You'd pack a light jacket and get burned, or pack a heavy coat and freeze. The average hides the danger.

The New Way (Distributional Regression):
The paper argues that modern AI models (like TabPFN and TabICL) should stop just giving you a single number. Instead, they should give you a full forecast: "There's a 50% chance of 40°F and a 50% chance of 100°F."

This is called Distributional Regression. It's like showing you the whole weather map instead of just the thermometer.

The Problem: How Do We Grade the AI?

The authors noticed a major flaw in how we test these new AI models.

Currently, when researchers build these models, they grade them like a math teacher grading a test. They ask: "How close was your single number to the real number?"

If the real answer was 100 and the AI guessed 95, they give it a high score.
If the AI guessed 105, they give it a high score.

The Flaw: This grading system forces the AI to become obsessed with finding the "middle ground" (the average). It teaches the AI to be a safe, boring guesser that ignores the crazy possibilities (like the freezing morning or the scorching afternoon).

The Solution: "Proper Scoring Rules"

The paper suggests we need a better way to grade the AI, one that rewards it for telling the whole truth about the uncertainty. They call these Proper Scoring Rules.

Think of it like grading a dart player:

Old Grading (MSE/RMSE): You only care if the dart hits the bullseye. If the player throws wildly but lands near the center, they get a passing grade.
New Grading (CRPS): You care about the shape of the throws. Did the player understand the wind? Did they know where the wind would push the dart? If the player says, "I'm aiming for the bullseye, but there's a 20% chance the wind pushes it left," and they are right about that risk, they get a better grade.

The paper specifically champions a metric called CRPS (Continuous Ranked Probability Score).

Analogy: Imagine you are betting on a horse race.
- Log Score (Cross-Entropy): You only get paid if you picked the exact winner. If you said "Horse A or Horse B" and Horse A won, you get nothing. This is too harsh and makes the AI afraid to be uncertain.
- CRPS: You get paid based on how well your list of possibilities matched reality. If you said "It's likely to be Horse A, maybe Horse B," and Horse A won, you get a good score. If you said "It could be any horse," and Horse A won, you get a lower score. CRPS rewards confidence with accuracy.

The Twist: The "Right" Answer Depends on the Goal

Here is the most fascinating part of the paper. The authors show that there is no single "best" way to predict. The "best" prediction depends on what you are trying to do.

The Analogy: The Car Accident
Imagine you are an insurance company.

Scenario A: You want to know the average cost of a car accident to set your monthly budget. You want the Mean (Average).
Scenario B: You are worried about a catastrophic crash that could bankrupt you. You care about the Tail (the worst-case scenario).

The paper proves that if you train an AI to minimize "Average Error," it becomes great at Scenario A but terrible at Scenario B. If you train it to minimize "Tail Risk," it becomes great at Scenario B but maybe less accurate on the average.

Key Takeaway: You cannot just download a "perfect" AI model. You have to fine-tune (adjust) the model based on your specific goal.

If you are a bank worried about losing money, you need an AI trained to fear the worst-case scenario.
If you are a logistics company just trying to guess average delivery times, you need an AI trained to hit the middle.

What Did They Actually Do?

They tested it: They took existing powerful AI models (TabPFN and TabICL) and re-trained them using these new "Proper Scoring Rules" (like CRPS) instead of the old "Average Error" rules.
The Result: The re-trained models were much better at predicting the shape of the data. They didn't just guess a number; they gave a much more honest picture of the risks and possibilities.
The Comparison: They found that a newer model called TabICL was generally better at this "probabilistic" thinking than the older TabPFN, but both improved significantly when they used the new scoring rules.

Summary for the Everyday Person

Don't settle for the average. In a complex world, the "average" is often a lie that hides the risks.
Change the grade. We need to stop testing AI only on how close it is to the middle. We need to test it on how well it understands the whole range of possibilities.
One size does not fit all. An AI isn't "smart" in a vacuum. It is smart for a specific job. If you are worried about rare disasters, you must tell the AI to prioritize those rare events, or it will ignore them.

The paper is a call to action: Stop building AI that just guesses the middle. Start building AI that understands the full story, including the scary parts.

1. Problem Statement

The paper addresses a critical gap in the evaluation and optimization of Tabular Foundation Models (specifically Prior-Data Fitted Networks like TabPFN and TabICL).

Current Limitation: Existing benchmarks (e.g., TabArena, TALENT) primarily evaluate regression performance using point-estimate metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and $R^2$ . These metrics implicitly optimize for the conditional mean.
The Flaw: Point estimates fail to capture aleatoric uncertainty (inherent randomness in data). In scenarios with multi-modal distributions (e.g., bimodal targets), the conditional mean may fall in a region of zero probability (a "valley" between modes), rendering the prediction useless for decision-making.
The Core Issue: While foundation models like TabPFN and TabICLv2 are capable of distributional regression (predicting full probability distributions), the community lacks standardized benchmarks and training objectives based on Proper Scoring Rules. Furthermore, the choice of scoring rule significantly alters the inductive bias of the model, meaning different rules yield different "optimal" forecasts even if the underlying data is identical.

2. Methodology

The authors propose a framework for evaluating and optimizing tabular foundation models using proper scoring rules rather than point-estimate losses.

A. Theoretical Framework: Proper Scoring Rules

The paper emphasizes that a Strictly Proper Scoring Rule $S(F, y)$ ensures the expected score is minimized only when the forecast distribution $F$ matches the true distribution $Q$ .

Log Score (Cross-Entropy): Highly sensitive to tails; penalizes low-density observations infinitely. It treats bins as independent categories, ignoring ordinal distance (e.g., predicting 100 when truth is 10 is penalized similarly to predicting 11).
Continuous Ranked Probability Score (CRPS): Measures the distance between the predicted Cumulative Distribution Function (CDF) and the observed value. It is robust to outliers and respects the ordinal nature of regression targets. It can be viewed as an integral over quantile losses (pinball losses).
$\beta$ -Energy Score: A generalization where $\beta=1$ corresponds to CRPS (focusing on median/MAE) and $\beta=2$ corresponds to MSE (focusing on mean).
Continuous Ranked Logarithmic Scoring Rule (CRLS): A hybrid approach proposed for finetuning.

B. Experimental Setup

Models Evaluated:
- realTabPFNv2.5: A pre-trained foundation model.
- TabICLv2: A newer foundation model utilizing in-context learning.
Datasets: A diverse set of OpenML regression datasets.
Protocol:
1. Subsampled datasets to 3,000 samples to simulate finite-sample settings.
2. Performed 5-fold cross-validation.
3. Finetuning: The authors finetuned realTabPFNv2.5 using custom loss functions: Beta-Energy Score ( $\beta=1.8$ ) and CRLS.
4. Comparison: Compared finetuned models against the base model and against TabICLv2.
Metrics: Evaluated using MAE, RMSE, $R^2$ , CRPS, CRLS, and Interval Score (95%).

C. Toy Model Analysis

To demonstrate the impact of scoring rules, the authors constructed a bimodal toy dataset. They showed that:

Standard Least Squares (MSE) fails to capture the bimodal nature, predicting a value in the low-probability gap.
Models trained with CRPS successfully learned the bimodal distribution.
Finite Sample Sensitivity: Different scoring rules (Log Score vs. CRPS) lead to different convergence speeds and sample efficiencies, even with identical architectures.

3. Key Contributions

Benchmark Proposal: Advocates for replacing or augmenting standard regression benchmarks with Proper Scoring Rules (specifically CRPS and CRLS) to properly evaluate probabilistic forecasts.
Empirical Evaluation: Provides the first comprehensive evaluation of realTabPFNv2.5 and TabICLv2 using CRPS, CRLS, and Beta-Energy scores.
Finetuning Strategy: Demonstrates that finetuning foundation models with specific scoring rules (e.g., Beta-Energy $\beta=1.8$ or CRLS) improves performance on probabilistic metrics compared to the pre-trained baseline.
Inductive Bias Analysis: Proves theoretically and empirically that the choice of scoring rule dictates the model's inductive bias. For instance, $\beta=1$ favors median estimation, while $\beta=2$ favors mean estimation.
Decision-Making Context: Highlights that the "optimal" model depends on the specific utility function of the end-user (e.g., finance vs. weather), necessitating flexible adaptation strategies.

4. Results

Finetuning Improvements:
- Finetuning realTabPFNv2.5 with Beta-Energy Score ( $\beta=1.8$ ) resulted in median improvements of +1.46% in MAE, +0.80% in RMSE, and +0.79% in CRPS across datasets.
- Finetuning with CRLS showed similar trends, with a median CRLS improvement of +1.47%.
Model Comparison (TabICLv2 vs. realTabPFNv2.5):
- TabICLv2 generally outperformed realTabPFNv2.5 in probabilistic metrics.
- CRLS Improvement: TabICLv2 showed a median improvement of +3.49% in CRLS over the base model.
- CRPS Improvement: TabICLv2 showed a median improvement of +0.73% in CRPS.
- Win Rates: TabICLv2 won the majority of dataset comparisons (e.g., 84/19/2 W/L/T for CRLS).
Dataset Variance: Improvements were not uniform. While models improved significantly on datasets like Mercedes Benz and Pol, they occasionally underperformed on specific datasets like Puma8NH or House Sales, indicating that the "best" scoring rule is dataset-dependent.

5. Significance and Outlook

Paradigm Shift: The paper argues for a shift from "point estimate" optimization to "distributional" optimization in tabular deep learning. This is crucial for applications requiring uncertainty quantification.
Inductive Bias Awareness: Practitioners must recognize that training with a specific loss function (e.g., MSE vs. CRPS) fundamentally changes the model's behavior. There is no single "best" foundation model; the best model depends on the scoring rule aligned with the business objective.
Future Directions:
- Adaptive Scoring: The authors suggest future work on promptable foundation models where users can specify a desired scoring rule (or utility function) via task tokens, allowing the model to adapt its inductive bias without full retraining.
- Asymmetric Risks: Highlighting the need for weighted scoring rules (e.g., Weighted CRPS) in domains like finance or energy, where under-prediction and over-prediction have asymmetric costs.
- Rare Events: The paper notes a limitation: standard proper scoring rules may struggle with epistemic uncertainty regarding rare tail events, suggesting caution when applying these models to high-stakes, low-frequency scenarios.

In conclusion, this paper provides a rigorous theoretical and empirical foundation for moving tabular foundation models beyond simple regression metrics, advocating for a scoring-rule-centric approach that aligns model training with specific probabilistic decision-making needs.