Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules

This paper critiques the reliance of current tabular foundation model benchmarks on point-estimate metrics like MSE, advocating instead for the adoption of proper scoring rules such as CRPS to evaluate probabilistic forecasts and the use of finetuning or promptable strategies to align model inductive biases with distributional regression goals.

Jonas Landsgesell, Pascal Knoll

Published 2026-03-10
📖 5 min read🧠 Deep dive

The Big Idea: Stop Guessing the "Average," Start Predicting the "Whole Picture"

Imagine you are trying to predict the weather for a picnic.

The Old Way (Point Estimates):
Most AI models today act like a weatherman who only gives you one number: "The average temperature will be 72°F."

  • The Problem: What if the real weather is actually a 50/50 split between a freezing 40°F morning and a scorching 100°F afternoon? The average of 72°F is technically "correct" mathematically, but it's useless for planning. You'd pack a light jacket and get burned, or pack a heavy coat and freeze. The average hides the danger.

The New Way (Distributional Regression):
The paper argues that modern AI models (like TabPFN and TabICL) should stop just giving you a single number. Instead, they should give you a full forecast: "There's a 50% chance of 40°F and a 50% chance of 100°F."

  • This is called Distributional Regression. It's like showing you the whole weather map instead of just the thermometer.

The Problem: How Do We Grade the AI?

The authors noticed a major flaw in how we test these new AI models.

Currently, when researchers build these models, they grade them like a math teacher grading a test. They ask: "How close was your single number to the real number?"

  • If the real answer was 100 and the AI guessed 95, they give it a high score.
  • If the AI guessed 105, they give it a high score.

The Flaw: This grading system forces the AI to become obsessed with finding the "middle ground" (the average). It teaches the AI to be a safe, boring guesser that ignores the crazy possibilities (like the freezing morning or the scorching afternoon).

The Solution: "Proper Scoring Rules"

The paper suggests we need a better way to grade the AI, one that rewards it for telling the whole truth about the uncertainty. They call these Proper Scoring Rules.

Think of it like grading a dart player:

  • Old Grading (MSE/RMSE): You only care if the dart hits the bullseye. If the player throws wildly but lands near the center, they get a passing grade.
  • New Grading (CRPS): You care about the shape of the throws. Did the player understand the wind? Did they know where the wind would push the dart? If the player says, "I'm aiming for the bullseye, but there's a 20% chance the wind pushes it left," and they are right about that risk, they get a better grade.

The paper specifically champions a metric called CRPS (Continuous Ranked Probability Score).

  • Analogy: Imagine you are betting on a horse race.
    • Log Score (Cross-Entropy): You only get paid if you picked the exact winner. If you said "Horse A or Horse B" and Horse A won, you get nothing. This is too harsh and makes the AI afraid to be uncertain.
    • CRPS: You get paid based on how well your list of possibilities matched reality. If you said "It's likely to be Horse A, maybe Horse B," and Horse A won, you get a good score. If you said "It could be any horse," and Horse A won, you get a lower score. CRPS rewards confidence with accuracy.

The Twist: The "Right" Answer Depends on the Goal

Here is the most fascinating part of the paper. The authors show that there is no single "best" way to predict. The "best" prediction depends on what you are trying to do.

The Analogy: The Car Accident
Imagine you are an insurance company.

  • Scenario A: You want to know the average cost of a car accident to set your monthly budget. You want the Mean (Average).
  • Scenario B: You are worried about a catastrophic crash that could bankrupt you. You care about the Tail (the worst-case scenario).

The paper proves that if you train an AI to minimize "Average Error," it becomes great at Scenario A but terrible at Scenario B. If you train it to minimize "Tail Risk," it becomes great at Scenario B but maybe less accurate on the average.

Key Takeaway: You cannot just download a "perfect" AI model. You have to fine-tune (adjust) the model based on your specific goal.

  • If you are a bank worried about losing money, you need an AI trained to fear the worst-case scenario.
  • If you are a logistics company just trying to guess average delivery times, you need an AI trained to hit the middle.

What Did They Actually Do?

  1. They tested it: They took existing powerful AI models (TabPFN and TabICL) and re-trained them using these new "Proper Scoring Rules" (like CRPS) instead of the old "Average Error" rules.
  2. The Result: The re-trained models were much better at predicting the shape of the data. They didn't just guess a number; they gave a much more honest picture of the risks and possibilities.
  3. The Comparison: They found that a newer model called TabICL was generally better at this "probabilistic" thinking than the older TabPFN, but both improved significantly when they used the new scoring rules.

Summary for the Everyday Person

  • Don't settle for the average. In a complex world, the "average" is often a lie that hides the risks.
  • Change the grade. We need to stop testing AI only on how close it is to the middle. We need to test it on how well it understands the whole range of possibilities.
  • One size does not fit all. An AI isn't "smart" in a vacuum. It is smart for a specific job. If you are worried about rare disasters, you must tell the AI to prioritize those rare events, or it will ignore them.

The paper is a call to action: Stop building AI that just guesses the middle. Start building AI that understands the full story, including the scary parts.