Analyzing Error Sources in Global Feature Effect Estimation

Imagine you have a super-smart but mysterious robot (a "black box" machine learning model) that makes predictions about things like house prices, loan approvals, or medical diagnoses. You want to know: "How does this robot actually think?" specifically, you want to know how changing one input (like the size of a house) changes the output (the price).

To answer this, data scientists use two popular tools called Partial Dependence (PD) and Accumulated Local Effects (ALE). Think of these tools as "flashlights" that shine on the robot's brain to see how it reacts to specific features.

However, the authors of this paper discovered a problem: Our flashlights aren't perfect. Sometimes the picture they show is blurry, sometimes it's biased, and sometimes it's just shaky. The big question they asked was: "Where does the blur come from, and how do we fix it?"

Here is the breakdown of their findings using simple analogies.

1. The Two Sources of "Blur" (Error)

When you try to measure how the robot thinks, your measurement can be wrong for two main reasons. The authors broke these down like a recipe for a bad photo:

The Robot's Own Confusion (Model Bias/Variance): Maybe the robot itself learned the wrong rules. If the robot is overconfident or confused, the flashlight will show a distorted picture.
The Flashlight's Shaky Hand (Estimation Bias/Variance): Even if the robot is perfect, you might be measuring it poorly.
- Shaky Hand (Variance): If you only look at a tiny sample of data, your measurement might jump around wildly. It's like trying to guess the average height of a crowd by measuring just three people.
- Wrong Angle (Bias): If you measure the robot using the same data it was trained on, it might look smarter than it really is (like a student memorizing the test answers).

2. The Great Debate: Training Data vs. New Data

For years, data scientists have argued over a practical question: "Should we use the data the robot learned on (Training Data) or brand new data (Holdout Data) to test it?"

Team Training Data: "Use the old data! We have more of it, so our measurement will be more stable."
Team Holdout Data: "No! The robot might have 'memorized' the old data (overfitting). We need to test it on fresh data to see the truth."

The Paper's Verdict:
The authors did a massive simulation (like running thousands of experiments in a lab) and found something surprising: It doesn't matter much which one you pick, but size does.

The "Memorization" Fear is Overblated: They found that even if the robot memorized the training data, the error introduced by using that data to explain the robot is tiny. It's like worrying that a chef tasted the soup while cooking it and therefore the final taste is ruined. The taste is fine.
The "Sample Size" King: The biggest factor is simply how much data you have. Using the larger training set usually gives a clearer picture than using a smaller, "fresh" test set. The benefit of having more data outweighs the risk of the robot being slightly biased.

3. The "Cross-Validation" Superpower

The paper suggests a third option: Cross-Validation (CV).
Imagine you are testing a student. Instead of giving them one final exam (Holdout) or letting them study the practice test (Training), you give them five different mini-tests and average the results.

Why it works: This smooths out the "shaky hand" errors. It reduces the noise significantly, especially for robots that are prone to overfitting (memorizing).
The Result: CV often gives the clearest, most reliable picture of how the robot thinks.

4. The Special Case of ALE (The Sensitive Tool)

The paper highlights that ALE (one of the two flashlights) is much more sensitive to sample size than PD.

Analogy: Think of PD as a wide-angle lens and ALE as a high-magnification microscope.
If you use a microscope (ALE) on a tiny sample, the image is grainy and useless. You need a lot of data to make ALE work well. If you don't have enough data, ALE's "shaky hand" gets much worse than PD's.

Summary: What Should You Do?

If you are trying to explain a machine learning model to a boss or a client, here is the practical advice from the paper:

Don't stress too much about "Overfitting Bias": You don't need to panic about using training data just because the model might have memorized it. The error is negligible.
Go Big on Data: If you have to choose, use the largest dataset available (usually the training data) to get the smoothest, most stable explanation.
Use Cross-Validation if you can: If you want the absolute best, most reliable explanation (especially for complex models), use Cross-Validation. It acts like a noise-canceling headphone for your data analysis.
Watch out for ALE: If you use the ALE method, make sure you have a huge amount of data, or your results will be too shaky to trust.

In a nutshell: The paper tells us that the "flashlights" we use to understand AI are actually quite robust. We don't need to be perfect purists about using "fresh" data; we just need to make sure we have enough data to get a clear picture.

1. Problem Statement

Global feature effect methods, specifically Partial Dependence (PD) and Accumulated Local Effects (ALE), are standard tools for interpreting black-box machine learning models. They visualize how features influence predictions. However, these methods rely on finite datasets to estimate true underlying effects, introducing various sources of error.

Despite their widespread use, the bias-variance trade-offs and specific error sources of these estimators remain largely unexplored. A critical practical ambiguity exists: Should feature effects be estimated using training data or holdout (validation/test) data?

Training data: Offers a larger sample size (reducing variance) but risks bias due to model overfitting.
Holdout data: Theoretically "cleaner" regarding overfitting bias but suffers from smaller sample sizes (increasing variance).

Prior work has focused on specific issues like extrapolation or aggregation bias but lacks a formal, estimator-level decomposition of Mean Squared Error (MSE) that separates model-related errors from estimation-related errors.

2. Methodology

The authors propose a systematic framework to analyze and decompose the errors in PD and ALE estimators.

A. Theoretical Framework: MSE Decomposition

The paper derives a full Mean Squared Error (MSE) decomposition for empirical PD and ALE estimators. The total error is decomposed into four distinct components:

Model Bias: Systematic error arising from the fitted model $\hat{f}$ deviating from the true data-generating function $f$ .
Estimation Bias: Error introduced by the estimation process itself (e.g., Monte Carlo integration bias, discretization bias in ALE, or bias from using dependent training data).
Model Variance: Variability in the feature effect estimates caused by the randomness in the model training process (different training sets yield different $\hat{f}$ ).
Estimation Variance: Variability caused by the finite sample size used to compute the feature effect (Monte Carlo integration noise or binning noise).

Key Theoretical Derivations:

PD: The authors prove that for holdout data, the estimator is unbiased relative to the theoretical model PD. The bias on training data is theoretically non-zero but depends on the model's overfitting characteristics. The estimation variance scales as $O(1/n)$ .
ALE: The decomposition includes a discretization bias (due to binning) and a bias term related to empty bins ( $n_S(k)=0$ ). The estimation variance is shown to be highly sensitive to sample size and the presence of interactions, scaling with the expected inverse bin counts.
Variance Bounds: Upper bounds are derived showing that model variance in PD is controlled by the average pointwise model variance, while ALE variance is controlled by the local variability of finite differences.

B. Empirical Validation (Simulation Study)

To validate the theory, the authors conducted an extensive simulation study across:

Data Generating Processes (DGPs): Three settings including linear/non-linear correlations (Simple-Normal-Correlated), non-linear interactions (Friedman1), and a physics-based wave interference model (Feynman).
Learners: Generalized Additive Models (GAM) and XGBoost.
Model Configurations: Optimally Tuned (OT) vs. Overfitting (OF) models.
Estimation Strategies:
1. Training: Estimating effects on the same data used for training.
2. Validation: Estimating effects on a held-out set.
3. Cross-Validation (CV): Estimating effects via 5-fold CV and averaging.
Metrics: MSE, Bias, and Variance were computed using dedicated estimators (averaging over multiple model fits and Monte Carlo repetitions).

3. Key Contributions

First Estimator-Level MSE Decomposition: The paper provides the first formal decomposition of MSE for empirical PD and ALE, separating model bias/variance from estimation bias/variance.
Theoretical Analysis of Data Selection: It formally analyzes the trade-off between training and holdout data, proving that while holdout data eliminates estimation bias from overfitting, training data introduces negligible bias in practice compared to the benefits of sample size.
ALE Sensitivity Analysis: It highlights that ALE is significantly more sensitive to sample size and discretization issues than PD, particularly when interactions are present.
Cross-Validation Recommendation: It identifies Cross-Validation as a superior strategy for reducing model variance, especially for overfitting models.

4. Key Results

Bias vs. Sample Size (RQ1):
- Negligible Training Bias: Contrary to intuition, using training data to estimate PD/ALE does not introduce significant bias, even for overfitting models. The potential bias from overfitting is empirically negligible compared to the reduction in variance gained from the larger sample size.
- Holdout Penalty: Using holdout data often results in higher MSE due to increased variance (smaller $n$ ) and, for ALE, increased bias (due to empty bins in small samples).
Variance Components (RQ2):
- Estimation Variance: This component is consistently highest for holdout data due to smaller sample sizes. ALE is particularly sensitive to this, showing much higher estimation variance than PD.
- Model Variance: Cross-Validation (CV) significantly reduces model variance by averaging across multiple model fits. This is most beneficial for overfitting models (XGBoost OF).
Sample Size Effects (RQ3):
- Estimation error decreases as $1/n$ for PD.
- For ALE, error decreases as $K/n$ (where $K$ is the number of bins) for small samples, converging to $1/n$ as sample size increases and bin occupancy becomes sufficient.
Optimal Strategy:
- Training Data: Generally yields the lowest MSE due to the large sample size, provided the model is not pathologically overfitting.
- Cross-Validation: The most robust approach, particularly for overfitting models, as it balances bias and variance by reducing the model variance component while maintaining a large effective sample size.

5. Significance and Practical Implications

Guidance for Practitioners: The paper resolves the debate on whether to use training or holdout data. It advises that training data is empirically safe and often preferable for feature effect estimation because the larger sample size dominates the error reduction, and overfitting bias is negligible.
Role of Cross-Validation: For models known to overfit or in high-stakes scenarios where model stability is critical, Cross-Validation is recommended as it minimizes the model variance component.
Method Selection: The findings suggest that ALE requires larger sample sizes than PD to achieve stable estimates, especially in the presence of feature interactions.
Theoretical Foundation: By formalizing the error sources, the paper provides a principled basis for future research into uncertainty quantification and improved estimation strategies for interpretable machine learning.

In summary, the paper demonstrates that the "cleanliness" of holdout data is often outweighed by the statistical benefits of larger sample sizes in training data, and that Cross-Validation offers a robust middle ground for reducing variance in feature effect estimation.

Analyzing Error Sources in Global Feature Effect Estimation

1. The Two Sources of "Blur" (Error)

2. The Great Debate: Training Data vs. New Data

3. The "Cross-Validation" Superpower

4. The Special Case of ALE (The Sensitive Tool)

Summary: What Should You Do?

1. Problem Statement

2. Methodology

A. Theoretical Framework: MSE Decomposition

B. Empirical Validation (Simulation Study)

3. Key Contributions

4. Key Results

5. Significance and Practical Implications

More like this

Overdispersed and Markovian Children

Surface temperature extremes produced by huge machine learning hindcasts of summer 2023

Inference conditional on selection: a review

Discrete Flow Maps

Hierarchical Riemannian manifold Hamiltonian Monte Carlo algorithms