Valid Feature-Level Inference for Tabular Foundation Models via the Conditional Randomization Test

Imagine you are a detective trying to solve a mystery: Which clues actually helped solve the case, and which ones were just red herrings?

In the world of data science, this is the problem of Feature Relevance. You have a bunch of data points (clues) and a target outcome (the crime). You want to know: Does Clue A actually tell us anything new about the crime, once we already know everything about Clues B, C, and D?

For a long time, modern AI (the "Black Box" detectives) was great at solving the case but terrible at explaining how they did it. They could give you a prediction, but they couldn't give you a mathematically proven "guilty" or "innocent" verdict for individual clues. They relied on guesswork or rules of thumb that often lied, especially when clues were correlated (e.g., "Rain" and "Wet Grass" often happen together, so it's hard to tell which one actually caused the puddle).

This paper introduces a new, super-powered detective tool that combines two things:

The Conditional Randomization Test (CRT): A rigorous statistical method that acts like a "What If?" simulator.
TabPFN: A pre-trained "Foundation Model" (a super-smart AI) that is already an expert at looking at tables of data and understanding patterns without needing to be retrained for every single new case.

Here is how the paper's solution works, explained through a simple analogy.

The "Magic Swap" Experiment

Imagine you are in a courtroom. The prosecution claims that Clue X (let's say, "The suspect's shoe size") is crucial to solving the crime. The defense says, "No way! Once you know the suspect's height and weight, the shoe size tells us nothing new."

To prove who is right, the judge (our statistical test) orders a Magic Swap:

The Setup: We take the real case file. We keep the suspect's height, weight, and the crime details exactly as they are.
The Swap: We magically erase the suspect's actual shoe size.
The Simulation: We ask our super-smart AI (TabPFN) to guess what the shoe size should have been, based only on the height and weight. It generates a "fake" shoe size that fits perfectly with the other clues.
The Test: We swap the real shoe size with this fake one. Now, we ask the AI: "If the shoe size was this fake one, how well could you still predict the crime?"
The Repeat: We do this swap 1,000 times, creating 1,000 different "fake" shoe sizes.

The Verdict:

If the AI's prediction gets much worse when we use the fake shoe sizes, it means the real shoe size was carrying unique, vital information. The AI noticed the difference. Verdict: Guilty (Relevant).
If the AI's prediction stays exactly the same whether the shoe size is real or fake, it means the shoe size was just a red herring. The other clues (height/weight) already explained everything. Verdict: Innocent (Irrelevant).

Why is this paper special?

Previous methods had two big problems:

They were too rigid: They assumed the world was a straight line (Linear) or followed a bell curve (Gaussian). Real life is messy, curved, and full of surprises.
They were too slow or shaky: To do the "Magic Swap," you usually had to build a new, custom AI model for every single clue you wanted to test. This was like hiring a new architect to redesign a house just to check if the front door matters. It took forever and often made mistakes.

The Paper's Innovation:
The authors used TabPFN. Think of TabPFN as a Master Chef who has already tasted millions of different recipes (datasets) during their training.

You don't need to hire a new chef for every dish. You just call the Master Chef.
The Chef instantly knows how ingredients (features) interact with each other.
Because the Chef is so good at guessing "What would the shoe size be given the height?", the "Magic Swap" is incredibly accurate.

The Results: What did they find?

The authors ran this "Magic Swap" test on 11 different types of made-up mysteries (simulations), ranging from simple straight-line relationships to complex, twisting, non-linear puzzles.

The "Innocent" Clues: When they tested clues that shouldn't matter, the test correctly said "Innocent" 95%+ of the time. It didn't cry wolf.
The "Guilty" Clues: When they tested clues that did matter, the test caught them almost every time, even when the clues were hidden inside complex, non-linear patterns.
The Correlation Trap: Even when two clues were highly correlated (like "Rain" and "Wet Grass"), the test could tell you which one was the actual cause and which one was just a side effect.

The Bottom Line

This paper gives us a reliable, mathematically sound way to ask AI: "Are you sure this clue matters?"

It bridges the gap between Modern AI (which is flexible and powerful but opaque) and Classical Statistics (which is rigorous and trustworthy but rigid). By using a pre-trained "Foundation Model" as the engine for this test, we get the best of both worlds: we can trust the p-values (the statistical verdicts) without sacrificing the ability to handle messy, real-world data.

In short: It turns the black box of AI into a transparent glass box where we can finally see, with statistical certainty, which features are doing the heavy lifting and which ones are just along for the ride.

Here is a detailed technical summary of the paper "Valid Feature-Level Inference for Tabular Foundation Models via the Conditional Randomization Test" by Mohamed Salem.

1. Problem Statement

Modern machine learning models (neural networks, ensembles, foundation models) excel at predictive performance but lack rigorous statistical interpretability. While classical statistical models provide valid hypothesis tests and $p$ -values, black-box models typically do not.

The Gap: Existing feature importance methods (e.g., Shapley values, permutation importance, LIME) are descriptive rather than inferential. They quantify contribution but do not provide statistically valid $p$ -values to test if a feature's contribution is significant.
The Specific Challenge: Practitioners need to answer: "Does covariate $X_j$ provide information about the target $Y$ beyond what is already explained by the remaining covariates $X_{-j}$ ?"
Limitations of Current Solutions:
- Classical tests (partial correlation, kernel methods) often rely on linearity, Gaussianity, or large-sample asymptotics, failing in small, nonlinear, or mixed-type tabular data.
- Heuristic methods (SHAP, LIME) conflate marginal and conditional relevance and lack formal statistical guarantees, especially in the presence of feature correlations.

2. Methodology

The paper proposes a framework combining the Conditional Randomization Test (CRT) with TabPFN (a probabilistic foundation model for tabular data).

A. The Statistical Framework: Conditional Randomization Test (CRT)

The CRT tests the null hypothesis of conditional independence:
$H_0: Y \perp\!\!\!\perp X_j \mid X_{-j}$
This asserts that $X_j$ contains no additional information about $Y$ given $X_{-j}$ .

Mechanism: The CRT constructs a null distribution by replacing observed values of $X_j$ with samples drawn from the conditional distribution $p(X_j \mid X_{-j})$ .
Validity: If the conditional distribution is sampled accurately, the observed test statistic is exchangeable with the null statistics, yielding finite-sample valid $p$ -values without asymptotic assumptions.

B. The Engine: TabPFN

The core innovation is using TabPFN (a transformer model trained via in-context learning on synthetic data) to handle the two critical modeling components of the CRT:

Predictive Model ( $p(Y \mid X)$ ): TabPFN estimates the relationship between covariates and the target to evaluate predictive performance.
Conditional Sampler ( $p(X_j \mid X_{-j})$ ): TabPFN models the conditional distribution of features to generate valid null samples.
- Advantage: Unlike traditional approaches requiring separate generative models or parametric assumptions for each feature, TabPFN performs Bayesian-style inference in a single forward pass without task-specific retraining.

C. Test Statistic

The authors utilize the Expected Log Predictive Density (ELPD) as the test statistic, defined as:
$T_{obs} = \frac{1}{n} \sum_{i=1}^n \log p(y_i \mid x_i)$

Rationale: Likelihood-based statistics are theoretically optimal against point alternatives (Katsevich & Ramdas, 2022). Since TabPFN provides calibrated posterior predictive distributions, ELPD serves as a proper scoring rule to measure how well the data fits the model.

D. Procedure

Fit TabPFN to estimate $p(Y \mid X)$ and $p(X_j \mid X_{-j})$ .
Compute the observed statistic $T_{obs}$ on the original data.
Generate $B$ null datasets by resampling $X_j$ from $p(X_j \mid X_{-j})$ while keeping $X_{-j}$ and $Y$ fixed.
Compute the statistic $T^{(b)}$ for each null dataset.
Calculate the $p$ -value: $p = \frac{1 + \sum I(T^{(b)} \ge T_{obs})}{B + 1}$ .

3. Key Contributions

Integration of Foundation Models and Inference: The paper demonstrates how to construct a valid conditional independence test by leveraging a pretrained foundation model (TabPFN) that requires no retraining.
Finite-Sample Validity: The method provides valid $p$ -values for feature relevance in finite samples, even in nonlinear, correlated, and mixed-type settings, without relying on asymptotic approximations.
Distinction of Conditional Relevance: Unlike Shapley values, this approach rigorously distinguishes between marginal association and conditional relevance, correctly identifying features that are only relevant due to correlation with other variables.
Computational Efficiency: By using a single-pass foundation model, the method avoids the computational prohibitive cost of iterative retraining or adversarial training required by other CRT implementations (e.g., using GANs).

4. Experimental Results

The authors evaluated the method on 11 synthetic datasets covering linear, nonlinear, interaction, and correlated regimes.

Type-I Error Control:
- The method achieved strong calibration across most benchmarks.
- In 6 out of 11 datasets, Type-I error was $\le 0.03$ (at $\alpha=0.05$ ).
- Limitation: Slightly elevated Type-I error was observed in "Correlated Linear" (0.10) and "Weak Signal" (0.07) settings. The authors attribute this to imperfect approximation of the conditional distribution $p(X_j \mid X_{-j})$ by TabPFN in sparse or highly complex scenarios, which violates the exchangeability assumption.
Statistical Power:
- The method demonstrated perfect power (1.00) in 8 out of 11 datasets (including Linear Sparse, Linear Dense, and Friedman 1).
- Power decreased in complex nonlinear interactions (Friedman 2: 0.60, Friedman 3: 0.40) and the "Conditional Null" scenario (0.00, correctly identifying no relevance).
Calibration: Empirical Cumulative Distribution Functions (ECDFs) and Quantile-Quantile (QQ) plots confirmed that $p$ -values for irrelevant features closely follow the Uniform(0, 1) distribution, while relevant features concentrate near zero.

5. Significance and Conclusion

This work bridges the gap between the flexibility of modern machine learning and the rigor of classical statistical inference.

Practical Impact: It offers a practical tool for feature selection and interpretation in high-stakes domains (medicine, economics) where valid $p$ -values are required but data is often small, nonlinear, and correlated.
Theoretical Contribution: It validates the use of foundation models not just for prediction, but as probabilistic engines for generating valid null distributions in hypothesis testing.
Future Directions: The authors suggest future work on scaling to very large datasets (beyond TabPFN's current context window), integrating with causal inference frameworks (DAGs), and developing diagnostics to detect when the conditional modeling quality is insufficient.

In summary, the paper establishes that TabPFN-based CRT is a robust, finite-sample valid method for feature-level hypothesis testing, effectively solving the "black-box" inference problem for tabular data.