Accounting for shared covariates in semi-parametric Bayesian additive regression trees

Imagine you are trying to understand why some students get great grades in math while others struggle. You have a massive list of clues: how many hours they study, their parents' education level, whether the school has discipline problems, if they have a computer at home, how often they are hungry, and hundreds of other factors.

This paper introduces a new, smarter way to analyze these clues. It's called CSP-BART. To understand why it's special, let's break it down using a simple analogy.

The Problem: The "Black Box" vs. The "Clear Glass"

In the world of data science, there are two main ways to look at these clues:

The Clear Glass (Linear Models): This is like a traditional math equation. It's great because you can see exactly how much each factor matters. For example, "If a student's parents have a university degree, the grade goes up by 10 points." It's easy to understand, but it's rigid. It assumes the world is simple and straight lines. It can't easily handle complex situations where factors mix together in weird ways (like how hunger might affect grades only if the student also has a computer).
The Black Box (Tree Models/BART): This is like a super-smart robot that looks at all the clues and finds patterns you never thought of. It's incredibly accurate at predicting grades. However, it's a "black box." You can't easily ask it, "How much did the parents' education actually matter?" because the robot has hidden those answers inside a tangled web of thousands of decision trees.

The Old Solution (SSP-BART):
Previously, researchers tried to combine these two. They said, "Let's put the important clues we want to understand (like parents' education) in the 'Clear Glass' part, and dump all the messy, complex clues into the 'Black Box'."

The Flaw: This forced the two parts to stay separate. It was like telling the Black Box, "You are not allowed to look at the parents' education level, even though that level might interact with the student's hunger." This meant the model missed important connections and gave slightly wrong answers about how much the parents' education really mattered.

The New Solution: CSP-BART (The "Shared Kitchen")

The authors of this paper propose a new method called CSP-BART. Instead of keeping the "Clear Glass" and the "Black Box" in separate rooms, they let them share the same kitchen.

Here is how it works, using a Chef and a Sous-Chef analogy:

The Head Chef (The Linear Part): This chef is in charge of the main ingredients you care about most (e.g., Parents' Education, Homework Time). They write down a simple recipe: "Add 10 points for University parents."
The Sous-Chef (The BART/Tree Part): This chef is a genius at spotting complex flavors. They handle the messy stuff: interactions, non-linear curves, and weird combinations.

The Innovation:
In the old method, the Sous-Chef was forbidden from touching the Head Chef's main ingredients. In CSP-BART, the Sous-Chef is allowed to use the main ingredients, but with a very strict rule: They cannot just copy the Head Chef's recipe.

If the Sous-Chef uses "Parents' Education," they must use it in a new, complex way (like an interaction). If they try to just say "University parents = +10 points" again, the system catches them and says, "Stop! The Head Chef already claimed that. You must do something different."

How They Fixed the "Double-Counting" Problem

The paper introduces two clever moves to make sure the Head Chef and Sous-Chef don't argue over who gets credit for the same thing:

The "Double-Grow" Move: Imagine the Sous-Chef wants to split the class based on "Parents' Education." In the old days, they would just make a simple split. In CSP-BART, if they try to split on a main ingredient, they are forced to immediately make a second split with something else (like "Hunger"). This forces the model to look at the interaction (Education + Hunger) rather than just the main ingredient alone.
The "Double-Prune" Move: If the Sous-Chef accidentally makes a tree that only talks about "Parents' Education" and nothing else, the system immediately cuts that branch off. It forces the Sous-Chef to focus only on the complex, messy interactions, leaving the simple main effects to the Head Chef.

Why Does This Matter? (The TIMSS Example)

The authors tested this on real data from the TIMSS 2019 study (a huge international math test). They wanted to know: How much does homework time actually help?

Old Models: Said homework helps a lot, and the more you do, the better you get.
CSP-BART: Found a more nuanced truth. Homework helps up to a point, but if you spend more than 90 minutes, your grades actually stop improving or even drop.
- The Insight: This suggests that students doing 90+ minutes of homework might be struggling students who are stuck on difficult problems, not super-achievers. The old models missed this "curved" relationship because they couldn't let the "Black Box" interact with the "Homework" variable properly.

The Bottom Line

This paper gives us a tool that is both accurate and understandable.

It lets us see the "main effects" clearly (like a linear model).
It automatically finds complex, hidden patterns (like a machine learning model).
Most importantly, it stops the two parts from fighting over the same data, ensuring we get the true, unbiased answer about what really drives student success.

It's like finally having a team where the expert in simple math and the expert in complex patterns can work together in the same room without stepping on each other's toes.

Here is a detailed technical summary of the paper "Accounting for Shared Covariates in Semi-Parametric Bayesian Additive Regression Trees" by Prado et al.

1. Problem Statement

Standard Generalized Linear Models (GLMs) and Generalized Additive Models (GAMs) struggle with high-dimensional data because they require the pre-specification of interaction terms and assume linear relationships (or require manual specification of non-linear basis functions). Conversely, standard Bayesian Additive Regression Trees (BART) are flexible and automatic in capturing interactions and non-linearities but act as "black boxes," making it difficult to interpret the specific marginal effects of covariates of primary interest.

Previous attempts to combine these approaches, specifically Separated Semi-Parametric BART (SSP-BART), attempted to solve the interpretability issue by splitting covariates into two mutually exclusive sets:

$X_1$ (Primary): Covariates of interest modeled via a linear predictor.
$X_2$ (Non-primary): Covariates modeled exclusively by the BART component to capture interactions and non-linearities.

The Limitation: SSP-BART assumes $X_1 \cap X_2 = \emptyset$ . This forces a rigid separation that prevents the model from capturing complex interactions between primary covariates or between primary and non-primary covariates. Furthermore, if a researcher wishes to allow a primary covariate to interact with others, they must move it to $X_2$ , losing its interpretability in the linear component. This leads to bias and non-identifiability issues when the sets are not truly disjoint or when interactions involving primary variables are critical.

2. Methodology: CSP-BART

The authors propose Combined Semi-Parametric BART (CSP-BART), a novel framework that relaxes the mutual exclusivity assumption ( $X_1 \cap X_2 \neq \emptyset$ is allowed) while ensuring the identifiability of the linear coefficients.

Key Technical Innovations:

Shared Covariates: The design matrix allows primary covariates ( $X_1$ ) to also be included in the BART component ( $X_2$ ). This enables the model to estimate main effects linearly while simultaneously capturing interactions involving these variables non-parametrically.
Double-Grow and Double-Prune Moves: To resolve non-identifiability issues (where both the linear term and the tree might try to estimate the same marginal effect), the authors introduce modified tree-generation moves:
- Double-Grow: When a stump is split using a covariate $x \in (X_1 \cap X_2)$ , a second split is immediately proposed using a different variable. Crucially, the prior on the terminal node parameter of the branch not containing the second split is modified (shrunk to zero). This forces the BART component to model only the interaction or non-linearity, preventing it from re-estimating the main effect already captured by the linear predictor.
- Double-Prune: The counterpart to double-grow. If a tree has a structure that would result in a single split on a shared covariate (reverting to a stump), it is pruned twice to ensure the resulting tree does not isolate a marginal effect of a shared variable.
Hierarchical Priors: Unlike SSP-BART, which assumes an isotropic prior (uncorrelated, equal variance) for linear coefficients, CSP-BART places a hierarchical prior on the covariance matrix of the linear coefficients ( $\Omega_\beta \sim \text{Inverse-Wishart}$ ). This allows the model to account for correlations among the primary covariates, improving inference and reducing bias.
Random Effects Extension: The framework is extended to include random effects in the linear component (similar to linear mixed models), allowing for grouping factors, while maintaining the identifiability constraints via the double-moves.

Algorithmic Implementation:

The model uses a Gibbs sampler with Metropolis-Hastings steps. The update cycle involves:

Updating linear parameters ( $\beta$ ) and covariance ( $\Omega_\beta$ ) given partial residuals.
Sequentially updating tree structures using the novel double-moves and standard moves (grow, prune, change, swap), subject to strict validity checks to ensure no branch estimates a marginal effect of a shared variable that is already in the linear predictor.
Updating the error variance ( $\sigma^2$ ).

3. Key Contributions

Theoretical: Proved that sharing covariates between parametric and non-parametric components is possible without non-identifiability, provided specific structural constraints (double-moves) are applied to the tree prior.
Methodological: Developed CSP-BART, which outperforms SSP-BART, Varying Coefficient BART (VCBART), and standard BART in terms of bias reduction for main effects and predictive accuracy.
Practical: Provided an R implementation (CSP-BART) and demonstrated its utility on complex, real-world educational data with missing values and high dimensionality.

4. Results

Simulation Studies

The authors conducted extensive simulations comparing CSP-BART against SSP-BART, VCBART, and GAMs:

Friedman Function: CSP-BART recovered true main effects with low bias, comparable to SSP-BART when no interactions existed.
Interaction Scenarios: In scenarios where primary covariates interacted (either with each other or non-primary covariates), SSP-BART exhibited significant bias because it could not model these interactions. CSP-BART successfully recovered the true main effects with low bias, demonstrating that the double-moves effectively isolate the linear effects while capturing interactions in the tree component.
VCBART Comparison: VCBART showed higher bias and computational cost (requiring separate trees for each coefficient) compared to CSP-BART.

Application: TIMSS 2019 Data

The model was applied to the Trends in International Mathematics and Science Study (TIMSS) 2019 data (Ireland, 8th grade) to predict math scores.

Primary Covariates: Parents' education level, minutes spent on homework, and school discipline problems.
Findings:
- CSP-BART identified significant negative effects of school discipline problems and positive effects of higher parental education.
- Non-linearity Discovery: Unlike linear models or SSP-BART, CSP-BART revealed a non-linear relationship for "minutes spent on homework." While moderate homework improved scores, spending "more than 90 minutes" was associated with lower scores (diminishing returns), suggesting these students might be struggling.
- Interactions: The model detected significant interactions between primary covariates (e.g., education level $\times$ homework time) which SSP-BART could not capture due to its mutual exclusivity constraint.
- Uncertainty: CSP-BART produced tighter credible intervals than SSP-BART, attributed to the hierarchical prior allowing for correlated coefficients.

Benchmark Application (Pima Indians Diabetes)

In a classification setting, CSP-BART achieved lower misclassification rates (17.94%) compared to SSP-BART (20.51%) and a hybrid model without double-moves (19.23%), confirming the importance of the structural constraints for both regression and classification.

5. Significance

This paper represents a significant advancement in semi-parametric Bayesian modeling. By resolving the identifiability conflict between linear and tree-based components, CSP-BART offers a "best of both worlds" solution:

Interpretability: It provides clear, unbiased estimates of main effects for variables of interest.
Flexibility: It automatically captures complex, unspecified interactions and non-linearities without requiring the user to pre-specify them.
Robustness: It handles scenarios where primary variables interact with the rest of the feature space, a common real-world scenario that previous semi-parametric BART models failed to address correctly.

The method is particularly valuable for fields like education research, epidemiology, and social sciences, where understanding the specific impact of policy-relevant variables (e.g., education level, intervention time) is crucial, yet these variables inevitably interact with a complex web of other factors.