Experiments with Optimal Model Trees

Imagine you are trying to teach a robot how to make decisions, like a doctor diagnosing an illness or a mechanic fixing a car. You want the robot to be smart, but you also want to be able to look at its "brain" and understand why it made a specific choice. This is the world of Interpretable AI.

This paper is about building a very specific type of decision-making robot called a Model Tree. Here is the story of what the authors did, explained simply.

1. The Problem: The "Greedy" Chef

Imagine you are a chef trying to create the perfect menu for a restaurant.

The Old Way (Greedy Algorithms): Most computer programs build decision trees like a chef who makes decisions one dish at a time without looking at the whole menu. They ask, "Is the customer hungry?" If yes, they pick a burger. Then they ask, "Do they like cheese?" If yes, they add cheese. They never look back to see if a different first question would have led to a better, simpler menu. This is fast, but it often results in a menu that is huge, messy, and confusing.
The "Classic" Decision Tree: In these trees, the final answer (the leaf) is just a single label, like "Buy Burger." It's simple, but it's rigid. It can't say, "Buy a burger if you are hungry, but adjust the price based on your income."

2. The Solution: The "Smart" Model Tree

The authors wanted to build a tree that is both small (easy to understand) and smart (very accurate).

The Innovation: Instead of just putting a static label at the end of a branch (like "Burger"), they put a mini-math formula (a linear model) there.
The Analogy: Think of a classic tree as a signpost that just says "Go Left." A Model Tree is a signpost that says, "Go Left, and here is a formula to calculate exactly how much you should pay based on your speed and weight." This allows the tree to be much smaller because the math at the end does the heavy lifting, rather than needing thousands of branches to cover every tiny detail.

3. The Challenge: Finding the Perfect Map

The problem is that finding the perfect tree structure is incredibly hard. It's like trying to find the shortest path through a maze with billions of possible turns.

The Old Way: Computers usually guess the path step-by-step (greedy). They might get stuck in a dead end because they didn't see the bigger picture.
The New Way (MILP): The authors used a powerful mathematical tool called Mixed-Integer Linear Programming (MILP).
- The Metaphor: Imagine you are a general trying to move an army across a continent. Instead of sending scouts to guess the best route one by one, you use a super-computer to calculate every possible route simultaneously and pick the single best one that minimizes distance and maximizes safety.
- This method forces the computer to look at the entire tree structure at once, ensuring the final result is globally optimal, not just locally good.

4. The Experiment: The Grand Tournament

The authors put their new "Optimal Model Trees" into a massive tournament against the best decision-makers in the world:

The Competitors: They fought against standard decision trees, Random Forests (a committee of many trees), and Support Vector Machines (complex math models).
The Results:
- Accuracy: The new trees were just as accurate, and often more accurate, than the complex competitors.
- Size: This was the big win. The new trees were tiny. While other methods grew massive, tangled forests of rules, the authors' trees were small, neat, and easy to read.
- The Trade-off: The only downside? It takes a long time to cook the meal. Because the computer is calculating the perfect solution, it can take hours (or hit a time limit) to solve complex problems. It's like waiting for a gourmet chef to perfect a dish versus grabbing a fast-food burger.

5. The Twist: Multivariate Splits

The authors also tested a "super-powered" version where the splits aren't just "Is X greater than 5?" but "Is 2X + 3Y greater than 10?"

The Result: This made the trees even more accurate but harder for humans to understand (like a recipe written in a secret code). The authors found that while this boosts performance, sticking to simple splits is usually better for keeping the AI "interpretable" (understandable to humans).

The Bottom Line

This paper proves that if you are willing to wait a little longer for the computer to think, you can build tiny, perfect decision trees that are incredibly accurate and easy for humans to understand.

In a nutshell:

Old Trees: Fast to build, but often huge and messy.
New Trees (Model Trees): Take longer to build, but they are small, precise, and explain their math.
Best for: Situations where trust and clarity are more important than speed, like medical diagnosis or financial lending, where you need to know exactly why a decision was made.

Here is a detailed technical summary of the paper "Experiments with Optimal Model Trees" by Roselli and Frank.

1. Problem Definition

The paper addresses the trade-off between predictive accuracy, model size, and interpretability in machine learning.

The Limitation of Greedy Algorithms: Standard decision tree algorithms (e.g., CART, C4.5) and typical model tree learners (e.g., M5P) operate greedily. They make locally optimal splits at each node without considering the global structure. This often results in trees that are unnecessarily large to achieve a specific accuracy level, reducing interpretability.
The Limitation of Classic Optimal Trees: While Mixed-Integer Linear Programming (MILP) has been used to find globally optimal decision trees with constant leaf values (e.g., OCT), these trees are piece-wise constant. They may require many splits to approximate complex non-linear relationships, leading to large tree sizes.
The Gap: There is a lack of extensive empirical research on Optimal Model Trees (OMTs)—trees where leaf nodes contain linear models (specifically Support Vector Machines) rather than constant values, learned via global optimization. The challenge lies in jointly optimizing the discrete tree structure and the continuous parameters of the linear models at the leaves.

2. Methodology

The authors propose a framework to learn Optimal Model Trees using Mixed-Integer Linear Programming (MILP).

A. Model Structure

Leaf Nodes: Instead of constant values, leaf nodes contain Linear Support Vector Machines (SVMs).
- For Regression: A single linear SVM per leaf minimizes absolute error (L1 loss).
- For Binary Classification: A single linear SVM per leaf maximizes the margin.
- For Multi-class Classification: Multiple SVMs (one per class) are trained per leaf; the class with the highest score is selected.
Tree Types:
- Univariate: Splits are based on a single feature and a threshold (axis-parallel).
- Multivariate: Splits are based on a linear combination of features (oblique splits), potentially improving accuracy at the cost of interpretability.

B. MILP Formulation

The problem is formulated to minimize a combination of model complexity (regularization) and prediction error.

Variables:
- Binary variables ( $d_n$ ) determine if a branch node splits.
- Binary variables ( $a_{f,n}$ ) select the feature for splitting (univariate) or continuous weights for multivariate splits.
- Binary variables ( $z_{i,n}$ ) assign data points to specific leaf nodes.
- Continuous variables ( $\beta, \delta$ ) represent SVM weights and intercepts.
- Continuous variables ( $\epsilon$ ) represent residuals or margins.
Objective Function: Minimizes the sum of absolute SVM weights (L1 regularization) and the sum of absolute errors (for regression) or margins (for classification).
Constraints:
- Enforce tree topology (parent-child relationships).
- Ensure data points follow the correct path based on split conditions (using Big-M method).
- Ensure splits are meaningful (non-empty child nodes).
- Limit the maximum number of splits ( $S$ ) and tree depth ( $D$ ).

C. Hyperparameter Tuning

Since the global optimum depends on hyperparameters, the authors employ an iterative search:

Regularization ( $C$ ): Tested on a logarithmic scale $\{0.1, 1, 10, 100\}$ .
Number of Splits ( $S$ ): Tested progressively from 0 up to the maximum possible for a given depth.
Validation: Datasets are split into Training/Validation/Test (0.6/0.2/0.2). The best $(C, S)$ pair is selected based on validation performance, then the model is retrained on the combined Training+Validation set.

3. Key Contributions

Novel Formulation: The paper introduces new MILP formulations for Optimal Classification Model Trees (OCMT) and Optimal Regression Model Trees (ORMT) using linear SVMs at the leaves. The classification formulation based on SVMs appears to be novel.
Extensive Empirical Evaluation: Unlike previous works that often rely on local search heuristics (e.g., LS-OMT), this paper focuses on the globally optimal approach, providing a rigorous comparison against:
- Greedy algorithms (CART, M5P, LMT).
- Other optimal methods (OCT, ORT, DL8.5).
- Ensemble methods (Random Forests).
- Standard linear models (SVMs).
Analysis of Multivariate Splits: The authors investigate the impact of replacing axis-parallel splits with multivariate ones, quantifying the accuracy gain versus the loss in interpretability.
Scalability Insights: The paper provides a detailed analysis of the computational complexity and time limits required to solve these MILP problems.

4. Experimental Results

The study evaluated 25 classification datasets (20 binary, 5 multi-class) and 20 regression datasets from OpenML.

Accuracy vs. Tree Size:
- OCMT/ORMT vs. OCT/ORT: Optimal model trees consistently achieved higher accuracy than optimal trees with constant leaf values for the same depth. In some classification cases, OCMT outperformed OCT by >30%.
- OCMT/ORMT vs. Greedy Methods: While Random Forests (RF) generally achieved the highest absolute accuracy, Optimal Model Trees were highly competitive with greedy methods (CART, M5P, LMT) but produced significantly smaller trees.
- Interpretability: The optimal model trees often had fewer than 10 leaves, whereas greedy model trees (like LMT and M5P) frequently grew to 40–250 leaves, making them difficult for humans to interpret.
Multivariate vs. Univariate:
- Multivariate trees (OCMT-H, ORMT-H) did not consistently outperform univariate trees in terms of accuracy, except in specific complex datasets (e.g., "Parity", "Long").
- However, multivariate optimal trees (without SVMs) generally outperformed their univariate counterparts significantly.
Computational Performance:
- Time Limits: Solving for trees with more than one split often hit the 3600-second time limit.
- Solution Quality: Even when the solver timed out, the "best incumbent" solutions found were often competitive with or superior to greedy algorithms.
- Scalability: The method is currently limited to datasets with $\le 10,000$ samples and $\le 50$ features due to the exponential growth of the MILP problem size.

5. Significance and Conclusion

Interpretability-Accuracy Trade-off: The paper demonstrates that Optimal Model Trees offer a superior balance between accuracy and interpretability compared to both classic optimal trees (which are too large) and greedy model trees (which are often too large and less accurate).
Practical Utility: For applications where interpretability is critical (e.g., healthcare, finance) and datasets are of moderate size, MILP-based optimal model trees provide a viable alternative to "black box" ensembles like Random Forests.
Future Directions: The authors suggest that while current MILP solvers struggle with large-scale problems, future work could focus on decomposition methods or applying these formulations to optimal policy trees.

In summary, the paper establishes that globally optimal model trees with linear SVM leaves are a powerful, interpretable machine learning tool that can match or exceed the accuracy of greedy methods while maintaining a compact, human-readable structure, provided the dataset size is manageable for current MILP solvers.