RocketStack: Level-aware Deep Recursive Ensemble Learning Architecture

Imagine you are trying to solve a very difficult puzzle, like predicting whether a customer will cancel their subscription or if a patient has a specific disease. You ask a group of experts (a "committee") for their opinions.

The Old Way (Standard Stacking):
Usually, you ask the experts, take their answers, and ask a "Manager" to combine them into one final decision. That's it. You stop there.

The Problem: If you try to make this deeper—asking the Manager to ask another Manager, who asks another Manager—you run into trouble. The pile of information gets too huge (too many features), the process gets too slow, and the Managers start getting confused or repeating each other's mistakes (overfitting). It's like trying to pass a message down a line of 100 people; by the end, the message is garbled and the line is clogged.

The New Way (RocketStack):
The author, Çağatay Demirel, built a system called RocketStack. Think of it as a high-tech, self-cleaning rocket ship designed to go much deeper into the "data universe" without exploding.

Here is how it works, using simple analogies:

1. The "Level-Aware" Elevator

Most stacking systems are like a building with only two floors. RocketStack is a skyscraper with 10 floors.

Floor 1: You take the original data (the raw ingredients) and mix it with the first round of expert opinions.
Floors 2–10: You keep going up. But here's the magic: at every floor, the system checks who is doing a good job and who is just making noise.

2. The "Pruning" Gardener (Cutting the Dead Branches)

As you go up the floors, the number of experts and the amount of information can get out of control.

The Problem: If you keep every expert, the system becomes bloated and slow.
The RocketStack Solution: Imagine a gardener with a pair of shears. At every level, the gardener looks at the "score" of each expert. If an expert is performing poorly, they get cut (pruned) and removed from the team.
The "Gaussian Noise" Trick: Sometimes, the gardener is too strict. They might cut a good expert just because they had one bad day. To fix this, RocketStack adds a tiny bit of "static" or "noise" to the scores before cutting. It's like telling the gardener, "Don't be too harsh; maybe that expert is just having a rough moment." This keeps a diverse team of experts alive longer, preventing the system from getting stuck on a mediocre solution.

3. The "Compression" Vacuum (Squeezing the Suitcase)

As you go up the floors, the "suitcase" of features (information) gets heavier and heavier.

The Problem: A suitcase that is too heavy is hard to carry (slow to compute).
The RocketStack Solution: Instead of squeezing the suitcase at every step (which might crush the important stuff), RocketStack waits. It lets the suitcase fill up for a few floors, then hits a periodic compression button (at floors 3, 6, and 9).
The Analogy: Imagine packing for a trip. If you pack a shirt, then immediately fold it, then pack a sock, then immediately fold it, you waste time. Instead, you pack a whole layer, then stop and vacuum-seal that layer to make it compact. Then you add the next layer. This keeps the suitcase manageable without losing the clothes you need.

4. The "No-Optimization" Surprise

Usually, in machine learning, you spend a lot of time tweaking the settings of your base experts (Hyperparameter Optimization) to make them perfect before you start.

The RocketStack Finding: Surprisingly, RocketStack works better if you don't obsess over tuning the base experts perfectly at the start.
The Analogy: Think of it like a sports team. If you hire a coach who tries to make every player perfect before the season starts, they might get rigid. But if you hire a team of "good enough" players and let the RocketStack system (the coach during the season) prune the weak ones and compress the strategy as the season goes on, the team actually performs better in the long run. The system learns to handle the "imperfections" and turns them into strengths.

The Result

By using these tricks—pruning the weak, compressing the data periodically, and adding a little bit of "noise" to keep things diverse—RocketStack can stack models 10 levels deep.

It's faster: It doesn't get bogged down by too much data.
It's smarter: It avoids the "garbage in, garbage out" problem of deep stacking.
It wins: On 33 different real-world datasets, it beat the current best "deep" models (like Deep Forest and TabNet), even without spending extra time tuning the base models.

In short: RocketStack is a smart, self-cleaning, self-compressing team of experts that gets better the deeper you go, proving that you don't need to stop at two floors to build a skyscraper of intelligence.

1. Problem Statement

Ensemble learning, particularly stacking (stacked generalization), is a cornerstone of machine learning for structured tabular data. However, existing implementations are predominantly shallow (limited to 1–2 meta-layers) due to three critical bottlenecks:

Feature Redundancy & Accumulation: As predictions propagate through stacking layers, the feature space expands (especially in multi-class tasks where OOF probabilities scale with the number of classes), leading to bloated feature spaces that degrade generalization.
Computational Burden: Deep recursive stacking without coordination leads to exponential growth in training time and memory usage.
Overfitting: Recursive application of non-linear learners compounds overfitting risks, particularly in low-sample settings or when cross-validation strategies are not tightly integrated.

Current solutions often operate in isolated stages (e.g., pre-processing feature selection) and lack a unified, modular framework to coordinate information across successive deep levels.

2. Methodology: The RocketStack Architecture

RocketStack is a modular, level-aware recursive stacking architecture designed to scale up to 10 stacking levels. It integrates dynamic model pruning and adaptive feature compression to maintain efficiency and performance.

Core Components:

Recursive Stacking Pipeline:
- Level 1: Base-learner predictions (Out-of-Fold or OOF probabilities) are fused with original input features.
- Subsequent Levels: New OOF scores are fused with the compressed feature matrix from the previous level.
- Stack-of-Stacking: A final aggregation step merges meta-features from all recursive levels to form a global representation for the final estimator.
Dynamic Model Pruning (OOF-Guided):
- At each level, models are pruned based on their OOF performance scores.
- Stochastic Perturbation: To prevent premature convergence to a narrow set of "best" models and encourage diversity, Gaussian noise is injected into OOF scores before selection.
  - Strict: Deterministic pruning ( $\lambda = 0$ ).
  - Light/Moderate: Gaussian noise injection ( $\lambda = 0.05$ or $0.1$).
- Models falling below an adaptive percentile threshold (based on mean and variance of scores) are discarded.
Feature Compression Strategies:
To control dimensionality, RocketStack employs three compression schemes, applied either per-level or periodically (at levels 3, 6, and 9):
- SFE (Simple, Fast, Efficient): A greedy utility-based feature selection method balancing relevance and redundancy.
- Autoencoders: Non-linear dimensionality reduction using 2-layer or 3-layer networks to compress features to ~1/3 of their original size.
- Attention Mechanisms: Learns relevance scores for features; only features in the top 25th percentile of attention scores are retained.
Experimental Setup:
- Datasets: 33 public datasets (23 binary, 10 multi-class) from OpenML covering diverse domains (finance, healthcare, engineering).
- Base Learners: 20 classifiers for binary tasks (e.g., XGBoost, LightGBM, SVM, MLP) and 14 for multi-class.
- Evaluation: Linear Mixed-Effects Models (LMM) were used for statistical trend analysis across depths, comparing accuracy, runtime, and feature counts.

3. Key Contributions

Deep Recursive Architecture: Introduces RocketStack, exploring stacking depths up to Level 10, a significant extension beyond the typical 1–2 layers found in literature.
Stochastic Pruning Mechanism: Demonstrates that injecting mild Gaussian noise into OOF scores during pruning improves stability and late-level gains by preventing early over-commitment to specific models (analogous to Dropout in neural networks).
Periodic vs. Per-Level Compression: Systematically evaluates that periodic compression (at levels 3, 6, 9) outperforms per-level compression. Periodic strategies allow for richer feature accumulation before reduction, whereas per-level compression is often too aggressive, causing performance drops.
Hyperparameter Optimization (HPO) Insights: Reveals that while HPO on base learners (Level 0) boosts initial performance, untuned RocketStack configurations often catch up or surpass tuned ones at deeper levels (Level 10), suggesting that the recursive architecture itself refines un-tuned signals effectively.
Sublinear Computational Growth: Achieves deep stacking with sublinear runtime growth through aggressive model pruning and periodic feature compression.

4. Key Results

Performance Trends:
- Accuracy: Increasing depth generally improves accuracy. At Level 10, RocketStack configurations (specifically those with periodic attention and light noise pruning) achieved 94.82% accuracy on multi-class tasks and 88.90% on binary tasks.
- Comparison: RocketStack outperformed established deep tabular baselines (TabNet and Deep Forest) even when those baselines were optimized with Bayesian HPO.
- Pruning Impact: Stochastic pruning (light noise, $\lambda=0.05$ ) consistently outperformed strict deterministic pruning, particularly in multi-class settings.
Efficiency & Scalability:
- Runtime: Periodic feature selection significantly reduced runtime growth compared to uncompressed or per-level compression. In multi-class tasks, periodic attention reduced normalized runtime to 0.439 at Level 10, compared to 1.0 for uncompressed stacking.
- Feature Control: Without compression, feature counts exploded (e.g., from ~145 to ~762 in multi-class). Periodic strategies kept feature counts manageable (e.g., dropping to ~100 after compression checkpoints).
- Model Pool: The number of active models decreased linearly from ~20 to ~4–6 by Level 10 due to pruning.
Statistical Significance:
- Linear Mixed-Models confirmed significant positive trends in accuracy for periodic strategies (e.g., periodic attention with light noise showed a ~13.94% improvement from Level 0 to 10).
- Per-level compression strategies often showed non-significant or negative trends.

5. Significance and Conclusion

RocketStack addresses the "depth barrier" in ensemble learning by providing a modular, depth-aware framework that balances predictive power with computational tractability.

Theoretical Impact: It challenges the notion that deep stacking is impractical for tabular data, proving that with coordinated pruning and periodic compression, hierarchical representation learning can be scaled effectively.
Practical Utility: The architecture offers a "plug-and-play" foundation for scalable decision fusion. It suggests that for deep ensembles, preserving base-level diversity (even without heavy HPO) and applying stochastic regularization during pruning are more critical than optimizing individual base models.
Future Directions: The paper proposes extending this architecture to "hyper-deep" ensembles (>10 levels) using temporally informed pruning strategies and applying it to larger, more complex benchmarks.

In summary, RocketStack demonstrates that deep recursive stacking is viable and competitive with state-of-the-art deep learning models for tabular data, provided that feature inflation and model redundancy are managed through periodic compression and stochastic pruning.

RocketStack: Level-aware Deep Recursive Ensemble Learning Architecture

1. The "Level-Aware" Elevator

2. The "Pruning" Gardener (Cutting the Dead Branches)

3. The "Compression" Vacuum (Squeezing the Suitcase)

4. The "No-Optimization" Surprise

The Result

1. Problem Statement

2. Methodology: The RocketStack Architecture

Core Components:

3. Key Contributions

4. Key Results

5. Significance and Conclusion

More like this

Identification and Inference in Nonlinear Dynamic Network Models

Learning Nonlinear Regime Transitions via Semi-Parametric State-Space Models

Bayesian Global-Local Shrinkage with Univariate Guidance for Ultra-High-Dimensional Regression

StrADiff: A Structured Source-Wise Adaptive Diffusion Framework for Linear and Nonlinear Blind Source Separation

The Hiremath Early Detection (HED) Score: A Measure-Theoretic Evaluation Standard for Temporal Intelligence