Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices

Imagine you have a brilliant, super-smart detective (a machine learning model) who is great at solving crimes (making predictions). This detective has a massive library of clues, a huge notebook of rules, and a team of assistants. However, you need to send this detective to work in a tiny, remote cabin in the woods (a small IoT device like a smart thermostat or a farm sensor) that runs on a single AA battery and has very little storage space.

If you try to pack the detective's entire massive library and notebook into that tiny cabin, it won't fit. The cabin will collapse, or the battery will die in an hour.

This is the problem the paper "Boosted Trees on a Diet" solves. The authors created a way to shrink these "detectives" (machine learning models) down so they can fit into tiny devices without losing their smarts. They call their method ToaD (Trees on a Diet).

Here is how they did it, using simple analogies:

1. The Problem: The "Full Suitcase" vs. The "Backpack"

Usually, when you train a smart model, it learns by looking at thousands of different clues (features) and setting thousands of different rules (thresholds).

The Old Way: Imagine the detective writes down every single rule on a separate piece of paper. If the rule is "If the temperature is above 20°C," they write that on a paper. If the rule is "If the temperature is above 21°C," they write that on another paper. Even if the rules are almost the same, they are stored separately. This takes up a huge suitcase.
The Goal: We need a backpack. We need to fit all the smarts into a tiny space.

2. The Solution: The "Shared Dictionary" (Global Lookups)

The authors realized that many rules are actually the same across different parts of the detective's brain.

The Analogy: Instead of writing "20°C" on a piece of paper every time it appears, the detective creates a Master Dictionary at the front of the cabin.
- The dictionary lists: "Entry 1 = 20°C", "Entry 2 = 21°C".
- Now, instead of writing the full number "20°C" everywhere, the detective just writes the number "1".
- If the detective needs to use "20°C" again in a different rule, they just point to "Entry 1" in the dictionary.
The Result: You save massive amounts of space because you aren't repeating the same numbers over and over. You are just using short codes.

3. The Training: "The Strict Coach" (Penalties)

How do you get the detective to stop writing new rules and start using the dictionary? You need a strict coach during the training phase.

The Analogy: Imagine the detective is learning to solve crimes. Every time they want to invent a new rule or use a new temperature number that isn't in the dictionary yet, the coach yells, "That costs extra points!"
The Trick: The coach makes it "expensive" (in terms of the model's internal score) to use a new feature or a new number. The detective quickly realizes, "Hey, it's cheaper to just reuse the old numbers I already have in the dictionary."
The Outcome: The detective naturally starts reusing the same clues and rules over and over. This forces the model to become "compact" by design, rather than just cutting things out at the end.

4. The Packing: "Bit-Packing" (Efficient Storage)

Finally, even the dictionary needs to be packed efficiently.

The Analogy: In a normal computer, a "Yes/No" answer might take up a whole page of paper just to be safe. But in this tiny cabin, the authors realized, "We only need one tiny dot to say Yes or No."
The Method: They use a technique called Bit-wise Encoding. Instead of using big, bulky storage for every number, they squeeze the information into the smallest possible bits.
- If a rule only needs to choose between 2 options, they use 1 bit.
- If a rule needs to choose between 4 options, they use 2 bits.
- They strip away all the "padding" and extra space that normal computers use.

Why Does This Matter?

Before this, if you wanted a smart AI on a tiny device (like a sensor in a remote forest that monitors for wildfires), you had to either:

Send all the data to a giant server in the cloud (which uses a lot of battery and needs internet).
Use a very dumb model that isn't very accurate.

With ToaD:

The device can be smart (it uses the same powerful "Boosted Tree" logic as big computers).
It fits in a tiny space (4 to 16 times smaller than before!).
It runs on battery power for months or years because it doesn't need to constantly talk to the cloud.

Summary

Think of ToaD as a master packer who helps you fit a whole library into a matchbox. They do this by:

Forcing reuse: Making the model reuse the same clues and rules instead of inventing new ones.
Creating a shared dictionary: Storing common numbers once and pointing to them everywhere.
Squeezing the data: Packing the information so tightly that it takes up the absolute minimum amount of space.

This allows "Tiny Machines" to become "Smart Machines," enabling them to make decisions right where the data is collected, without needing a power plant or an internet connection.

1. Problem Statement

The deployment of Machine Learning (ML) models on Internet of Things (IoT) devices is increasingly critical for applications like remote monitoring, predictive maintenance, and edge analytics. However, these devices (e.g., Arduino, microcontrollers) face severe constraints:

Limited Memory: Often restricted to kilobytes (KB) of RAM and Flash storage (e.g., 32KB RAM, 256KB Flash).
Energy Constraints: Many operate on batteries for months or years, making energy-intensive data transmission to the cloud infeasible.
Computation Limits: Low clock speeds (e.g., 48 MHz) prevent running heavy models.

While Gradient Boosted Decision Trees (GBDTs) like LightGBM and XGBoost are state-of-the-art for structured/tabular data, their standard implementations are too large for these "TinyML" environments. Existing compression techniques (post-training pruning, quantization) often fail to exploit specific structural redundancies, such as the reuse of features and thresholds across the ensemble, or they degrade model performance significantly.

2. Methodology: Trees on a Diet (ToaD)

The authors propose ToaD, a framework that compresses boosted decision tree ensembles by modifying the training process and the memory layout. The approach consists of two main pillars:

A. Training with Custom Regularizers

Instead of standard boosting, ToaD introduces a modified objective function that penalizes the introduction of new features and new thresholds.

Objective: Minimize the standard loss plus a complexity term, but with added penalties for expanding the set of used features ( $F_U$ ) and thresholds ( $T^f$ ).
Modified Gain: The standard gain calculation for splitting a node is adjusted to subtract costs for using new resources:
$\Delta_l(I, i, \mu) = \Delta(I, i, \mu) - s_f \cdot \iota - s_t \cdot \xi$
Where:
- $\Delta$ is the standard gain.
- $s_f = 1$ if a new feature index is used, $0$ otherwise.
- $s_t = 1$ if a new threshold is used for a feature, $0$ otherwise.
- $\iota$ and $\xi$ are hyperparameters controlling the penalty strength for features and thresholds, respectively.
Effect: This encourages the model to "reuse" existing split points and features across different trees in the ensemble, effectively forcing the model to learn a more compact representation.

B. Specialized Memory Layout

Once trained, the ensemble is stored using a highly optimized, pointer-less structure designed for microcontrollers:

Global Lookup Tables:
- Features & Thresholds: Instead of storing values in every node, all unique features and their associated thresholds are stored in global arrays. Nodes only store indices pointing to these arrays.
- Leaf Values: All leaf prediction values are stored in a single global array, allowing multiple leaves across different trees to reference the same value if they are identical.
Bit-wise Encoding:
- Pointer-less Trees: Trees are stored as arrays where the left child of index $i$ is at $2i+1$ and the right at $2i+2$ . This eliminates the need for memory-heavy pointers.
- Variable Precision: The system dynamically encodes data types. For example, boolean features use 1 bit, small integers use 2-4 bits, and floats use 32 bits. Thresholds are stored with variable bit-widths depending on the feature's requirements.
- Metadata: A compact header stores the number of trees, max depth, and the bit-widths required for decoding.

3. Key Contributions

Training-Time Compression: Unlike post-training pruning, ToaD integrates compression into the learning process via custom regularizers ( $\iota, \xi$ ), ensuring the model is inherently compact without sacrificing the ability to find optimal splits.
Global Value Sharing: The introduction of global arrays for features, thresholds, and leaf values significantly reduces redundancy, as the same split logic or prediction value is not stored multiple times.
Efficient Encoding: A novel bit-wise encoding scheme that adapts to the data type (integers vs. floats) and eliminates pointers, drastically reducing the memory footprint.
Open Source Implementation: The method is implemented as an extension to the popular LightGBM framework, making it accessible for practitioners.

4. Experimental Results

The authors evaluated ToaD on eight datasets (classification and regression) against baselines including standard LightGBM, quantized LightGBM (FP16), array-based LightGBM, and pruning methods (CCP, CEGB).

Compression Ratio: ToaD achieved compression ratios of 4x to 16x compared to standard LightGBM models while maintaining comparable accuracy.
- Example: On the Covertype multiclass dataset, a ToaD model at 2KB achieved 69% accuracy, whereas the best competitor (quantized LightGBM) required 8KB to match this performance.
Performance vs. Memory: In the "interesting" memory range (up to 128KB), competing methods required 4–16 times more memory to achieve the same performance score.
Sensitivity Analysis:
- Increasing penalties ( $\iota, \xi$ ) successfully reduced the number of unique features and thresholds.
- The "Reuse Factor" (ReF) increased with moderate penalties, indicating efficient sharing of values.
- There is a trade-off: extremely high penalties reduce memory drastically but eventually degrade accuracy as the model loses necessary complexity. However, a "sweet spot" exists where memory is minimized with negligible accuracy loss.
Runtime: On microcontrollers (Arduino Nano, ESP32-S3), ToaD inference was slightly slower (5x–8x) than standard LightGBM due to the bit-wise decoding overhead, but still operated in sub-millisecond time (<1ms), which is acceptable for most IoT use cases where energy and memory are the primary bottlenecks.

5. Significance and Impact

Enabling Autonomous Edge AI: ToaD makes it feasible to run powerful GBDT models on extremely low-resource devices (e.g., <32KB RAM), enabling fully autonomous operation without constant cloud connectivity.
Energy Efficiency: By enabling on-device inference, the need to transmit raw sensor data to the cloud is eliminated, saving significant energy associated with radio transmission (e.g., LoRa, Wi-Fi).
Practical Deployment: The method bridges the gap between high-performance structured data models and the physical limitations of embedded hardware, opening doors for applications in remote agriculture, industrial monitoring, and smart homes where power and memory are scarce.

In conclusion, "Boosted Trees on a Diet" provides a holistic solution combining algorithmic regularization and data structure optimization to make state-of-the-art tree ensembles viable for the next generation of resource-constrained IoT devices.