Decomposing Observational Multiplicity in Decision Trees: Leaf and Structural Regret

Imagine you are a bank manager trying to decide whether to approve a loan for a customer. You have a computer program (a Decision Tree) that looks at the customer's data and says "Yes" or "No."

You might think, "If the computer is smart, it will always give the same answer for the same person." But this paper reveals a surprising truth: The computer might be guessing, and it doesn't even know it.

Here is the story of the paper, explained simply with some everyday analogies.

The Big Problem: "The Rashomon Effect"

In the movie Rashomon, a crime is told from four different perspectives, and all four stories are slightly different, yet all seem plausible. In machine learning, this is called Predictive Multiplicity.

It turns out that for many real-world problems (like credit scores), there isn't just one perfect computer model. There are dozens of different models that are all equally good at predicting the past, but they might give conflicting answers for the same person today.

Model A says: "Approve the loan."
Model B says: "Reject the loan."

Both models are "correct" based on the data they were trained on, but the difference comes from Observational Multiplicity. This is a fancy way of saying: The data we collected is just one random snapshot of reality. If we had collected the data on a different day, or with slightly different noise, we might have trained a totally different model.

The Solution: Breaking the Tree into Two Parts

The authors of this paper looked at Decision Trees (which look like flowcharts with branches and leaves) and realized that the "guessing" happens in two distinct ways. They invented a way to measure these two types of uncertainty, which they call Regret.

Think of a Decision Tree like a giant tree in a park.

1. Leaf Regret: The "Crowded Room" Problem

Imagine a specific branch of the tree (a "leaf") where 10 people are standing. The tree decides that everyone in this group gets a loan.

The Issue: If you look closely at those 10 people, maybe 6 are great borrowers and 4 are risky. The tree just averages them out.
Leaf Regret measures how much the answer would wiggle if you swapped out a few people in that specific group. It's the noise inside a single room.
The Fix: If you put more people in that room (make the leaf bigger), the average becomes more stable. The "noise" goes down.

2. Structural Regret: The "Shaky Branch" Problem

Now, imagine the tree itself. What if the branch that leads to that room is wobbly?

The Issue: Because the data is noisy, the computer might draw the map of the tree differently every time. One day, it puts Person X in the "Safe" room. The next day, because of a tiny change in the data, it moves Person X to the "Risky" room.
Structural Regret measures how much the map itself changes. It's the instability of the branches.
The Surprise: The paper found that this is the big problem. In many cases, the tree moving people around (Structural Regret) causes way more confusion than the noise inside the rooms (Leaf Regret). In some datasets, the tree structure was 15 times more unstable than the noise inside the leaves!

The "Honesty" Test: Knowing When to Say "I Don't Know"

So, what do we do with this? The authors suggest a safety mechanism called Selective Prediction.

Imagine you are a doctor. If a patient has a very clear symptom, you diagnose them. But if the symptoms are vague and the test results are shaky, you don't guess; you say, "I need a second opinion."

The paper shows that by measuring Structural Regret, the computer can identify the people it is "guessing" about.

The Result: When the computer refuses to make a decision for the "shaky" cases (the ones with high structural regret), it becomes incredibly accurate for the cases it does answer.
Real-world impact: In their tests, by simply saying "I'm not sure" for the risky cases, the model's ability to catch all the good loan applicants (Recall) went from 92% to 100% on the most stable groups.

The Takeaway

This paper teaches us three important lessons for the future of AI:

Don't trust the map blindly: The biggest source of error in decision trees isn't the data inside the groups; it's the fact that the groups themselves keep moving around.
Stability matters more than just accuracy: A model that is 99% accurate but changes its mind every time you tweak the data is dangerous. We need models that are "stable."
It's okay to abstain: The safest AI isn't the one that answers every question; it's the one that knows when to say, "This decision is too arbitrary for me to make," and flags it for a human to review.

In short, the authors gave us a tool to measure how much the computer is guessing, and they showed that for decision trees, the guessing usually comes from the tree's structure being too wobbly, not just from bad data.

Here is a detailed technical summary of the paper "Decomposing Observational Multiplicity in Decision Trees: Leaf and Structural Regret" by Mustafa Cavus.

1. Problem Statement

The paper addresses the issue of predictive multiplicity, specifically observational multiplicity, in decision tree classifiers.

Context: In high-stakes domains (e.g., credit scoring, healthcare), multiple models often achieve near-identical aggregate accuracy but assign conflicting predictions to the same individual.
The Gap: While theoretical frameworks for observational multiplicity exist for smooth models (like logistic regression), they are underexplored for non-smooth, partition-based models like decision trees.
The Core Issue: Decision trees are notoriously unstable; small perturbations in training data (specifically the stochastic realization of labels) can lead to drastically different tree structures. Current methods fail to distinguish between uncertainty caused by label noise within a fixed region versus instability of the decision boundaries themselves.

2. Methodology

The authors propose a formal decomposition of total predictive uncertainty into two complementary components: Leaf Regret and Structural Regret.

A. Definitions

Leaf Regret ( $R_{leaf}$ ):
- Definition: The conditional variance of the probability estimator within a fixed leaf, given a specific tree structure.
- Source: Intrinsic variability due to finite-sample noise (aleatoric uncertainty) in the labels assigned to that leaf.
- Formula: $R_{leaf}^L = \text{Var}(\hat{p}_L | L) = \frac{p^*_L(1-p^*_L)}{n_L}$ , where $n_L$ is the number of samples in the leaf.
- Properties: It is a well-defined statistical quantity with a closed-form expression. It vanishes asymptotically as leaf size ( $n_L$ ) increases.
Structural Regret ( $R_{struct}$ ):
- Definition: The variance in predictions caused by the randomness in the tree structure itself (i.e., how the tree partitions the space changes across different label realizations).
- Source: Algorithmic instability and epistemic uncertainty arising from the learning process.
- Properties: Generally intractable analytically due to the discrete nature of tree induction; estimated via Monte Carlo resampling.

B. Theoretical Framework

Decomposition Lemma: The total predictive variance for an input $x$ is decomposed as:
$\text{Var}(\hat{p}(x)) = \mathbb{E}_T[R_{leaf}^{L(x;T)}] + R_{struct}(x)$
This separates the expected local noise from the global structural instability.
Estimation:
- Leaf Regret: Estimated using a plug-in estimator $\hat{R}_{leaf} = \frac{\hat{p}_L(1-\hat{p}_L)}{n_L}$ or via Monte Carlo resampling within a fixed leaf.
- Structural Regret: Estimated via a two-stage Monte Carlo procedure: generating bootstrap datasets, retraining trees, and calculating the variance of predictions for a fixed input across these trees.

3. Key Contributions

Formal Decomposition: The paper introduces the first rigorous decomposition of observational multiplicity in decision trees into Leaf Regret (local noise) and Structural Regret (global instability).
Statistical Guarantees: The authors provide concentration inequalities and consistency proofs for the estimation of Leaf Regret, establishing it as a reliable metric.
Empirical Validation: Through semi-synthetic experiments, the paper validates that the sum of estimated Leaf and Structural Regret perfectly matches the true predictive variance observed in simulations.
Safety Application: The framework is applied to Selective Prediction, using regret measures as an abstention mechanism to identify "arbitrary" predictions where the model is effectively guessing.

4. Experimental Results

The study was evaluated on diverse credit risk datasets (e.g., taiwan_credit, german_credit, bank_marketing).

Decomposition Accuracy: Figure 1 shows a near-perfect alignment ( $y=x$ ) between the sum of estimated regrets and the true simulated variance, confirming the validity of the decomposition.
Dominance of Structural Regret:
- Structural regret is the primary driver of predictive multiplicity.
- In the taiwan_credit dataset, Structural Regret was 15.34 times larger than Leaf Regret.
- In bank_marketing, it was 12.67 times larger.
- Implication: The instability of partition boundaries is a far greater source of uncertainty than label noise within leaves.
Impact of Leaf Size: Increasing the minimum leaf size ( $n_L$ ) significantly reduces Leaf Regret (confirming Lemma 2) but increases Logistic Loss (underfitting), highlighting a trade-off between stability and bias.
Selective Prediction Performance:
- By ranking individuals based on regret and abstaining on high-regret cases, the model improves safety.
- On the german_credit dataset, abstaining on the most unstable 8% of the population (reducing coverage) increased Recall from 92% to 100% for the remaining stable sub-population.
- Structural Regret proved to be a more effective filter for identifying unstable regions than Leaf Regret.

5. Significance and Implications

Algorithmic Safety: The framework provides a diagnostic tool to distinguish between "noise" (which can be reduced by more data) and "instability" (which requires structural regularization).
Deployment Strategy: Since structural regret dominates, safety-critical deployments should prioritize model structure stabilization (e.g., stronger pruning, ensemble methods, or regularization) rather than simply increasing leaf sample sizes.
Human-in-the-Loop: The "honesty" principle suggests that when a model's regret is high, it signals that a confident automated decision is statistically unsupportable. This allows systems to flag such cases for manual review, preventing arbitrary decisions in high-stakes scenarios.
Theoretical Bridge: The work bridges the gap between high-level observational multiplicity theory and classical tree-based induction, offering a principled way to quantify and mitigate predictive arbitrariness.

In conclusion, the paper establishes that for decision trees, structural instability is the main source of predictive arbitrariness. By quantifying this via Structural Regret, practitioners can build safer, more interpretable, and more reliable decision systems.