XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost

Imagine you have a massive spreadsheet filled with real-world data—like medical records, bank loan applications, or census information. This data is a goldmine for training AI, but it's also a privacy nightmare. You can't just share the real spreadsheet because it contains people's private secrets.

Enter XGenBoost. Think of it as a master forger that can create a "fake" spreadsheet that looks, feels, and behaves exactly like the real one, but contains zero real people.

The paper introduces a new way to build this forger using XGBoost (a tool usually used for making predictions, like "will this loan default?") instead of the usual heavy-duty AI tools (Deep Neural Networks) that require massive supercomputers.

Here is how XGenBoost works, broken down into two simple scenarios:

1. The "Small Dataset" Problem: The Art Restorer

The Scenario: You have a small, precious dataset (like a rare medical study with only 500 patients).
The Old Way: Trying to use a giant, complex AI (like a Diffusion Model) on this is like trying to restore a tiny, fragile watercolor painting with a sledgehammer. It's overkill, expensive, and might ruin the delicate details.
The XGenBoost Solution (XGenB-DF):

The Metaphor: Imagine a master art restorer who looks at the painting and tries to "un-blur" it. They don't just guess; they use a very specific, efficient tool (XGBoost) to figure out how to remove the noise and reveal the original image.
How it works: The model starts with a pile of random noise (static) and slowly cleans it up, step-by-step, until a perfect fake dataset emerges.
The Magic Trick: Real data has a mix of numbers (age, income) and categories (gender, city). Most AI tools force everything into numbers first (like turning "Red" into "1" and "Blue" into "2"), which loses nuance. XGenBoost handles them natively, like a chef who knows exactly how to handle both a delicate fish and a tough steak without turning them into a smoothie.
Result: It creates high-quality fakes from small data without needing a supercomputer.

2. The "Large Dataset" Problem: The Assembly Line

The Scenario: You have a massive dataset (like millions of bank transactions).
The Old Way: The "Art Restorer" method above is too slow for millions of rows. It would take forever to clean the noise one by one.
The XGenBoost Solution (XGenB-AR):

The Metaphor: Imagine a factory assembly line. Instead of trying to build the whole car at once, the factory builds it part by part. First, they build the engine. Then, they build the chassis based on the engine. Then the wheels based on the chassis.
How it works: This is called an Autoregressive model. It asks a series of simple questions: "Given the age and income, what is the likely city?" Then, "Given the age, income, and city, what is the likely job?"
The Secret Weapon: Instead of using a giant, slow neural network to answer these questions, it uses XGBoost classifiers. Think of XGBoost as a team of expert detectives. Each detective is great at spotting patterns in specific clues. They work together in a hierarchy to build the fake data row by row.
Result: It can churn out millions of fake records in minutes on a standard computer, whereas other methods might take hours or days on expensive GPUs.

Why is this a Big Deal?

Democratization (The "Forklift" vs. The "Ferrari"):
Most current AI tools for making fake data are like Ferraris. They are fast and powerful, but they only work if you have a massive garage (supercomputers/GPUs) and a lot of money. XGenBoost is like a reliable, efficient forklift. It gets the job done just as well, but it runs on a standard truck engine (a regular CPU). This means researchers in developing countries or small hospitals can use it without needing a tech giant's budget.
Respecting the Data's Nature:
Real-world data is messy. It has categories (like "Yes/No") and numbers (like "Salary").
- Old AI: Often tries to force categories into numbers (One-Hot Encoding), which is like trying to measure the "taste" of a fruit by weighing it. It's clumsy.
- XGenBoost: Uses the natural strength of tree-based models (XGBoost) which are already experts at splitting data based on categories. It doesn't force a square peg into a round hole.
Privacy vs. Quality Balance:
The paper shows that XGenBoost can create data that is so realistic that other AI models can't tell the difference between the real and fake data (High Fidelity), but it also includes safety features (like "dropout") to ensure it doesn't accidentally memorize and leak a specific person's real data.

The Bottom Line

The authors are saying: "Stop trying to force square pegs (neural networks) into round holes (tabular data)."

By using tools that were already designed to handle mixed-type data efficiently (XGBoost), they built a system that is cheaper, faster, and just as accurate as the expensive, heavy-duty AI methods currently in use. It's a "data-first" approach that respects the nature of the information it's trying to synthesize.

1. Problem Statement

The synthesis of mixed-type tabular data (containing both numerical and categorical features) is critical for data augmentation, privacy-preserving data sharing, and federated learning. While deep generative models (e.g., GANs, VAEs, Diffusion models) have become the state-of-the-art, they face significant limitations:

Resource Intensity: They require powerful GPUs and extensive training time, creating a barrier for users with limited computational resources.
Inductive Bias Mismatch: Neural networks often struggle with the specific inductive biases of tabular data, whereas tree-based ensembles (like XGBoost) are known to outperform deep learning on discriminative tabular tasks due to their ability to handle non-linearities and categorical splits natively.
Scalability Issues: Existing tree-based generative models (e.g., Adversarial Random Forests, Unmasking Trees) often fail to scale to large datasets because they require massive extensions of the training set (e.g., $K \approx 50-100$ times) to learn invariances or random-order synthesis, leading to prohibitive computational costs.

The authors ask: Can tree ensembles replace neural networks as function approximators in generative architectures for tabular data, offering better quality at lower computational costs?

2. Methodology: XGenBoost

The authors propose XGenBoost, a framework comprising two distinct generative models tailored to dataset size, both leveraging XGBoost as the core learner. The architectures are designed to respect the natural constraints of tree-based learners (single-output, no mini-batching, native categorical handling).

A. XGenB-DF: For Small Datasets (Diffusion Model)

Designed for smaller tabular datasets where extending the training set is permissible.

Architecture: A Denoising Diffusion Implicit Model (DDIM) where XGBoost acts as the score-estimator.
Hybrid Diffusion: It combines Gaussian diffusion for numerical features and Multinomial diffusion for categorical features.
- Numerical: Features are quantile-transformed and normalized. XGBoost regressors predict the noise/velocity.
- Categorical: XGBoost classifiers predict probabilities directly on integer-encoded categories, avoiding one-hot encoding. This leverages XGBoost's native ability to split on categories, preserving the discrete nature of the data and avoiding the non-smooth densities caused by one-hot encoding in Gaussian diffusion.
Training Strategy: Since XGBoost does not use mini-batches, the dataset is duplicated $K$ times to estimate expectations over different noise levels per sample.
Privacy Control: A dropout mechanism is applied to numerical features (masking with the mean) to prevent the model from memorizing specific sample coordinates, offering a tunable trade-off between fidelity and privacy.

B. XGenB-AR: For Large Datasets (Autoregressive Model)

Designed to scale to massive datasets without extending the training set.

Architecture: A fixed-order autoregressive model where conditionals $P(x_{\pi(t)} | x_{\pi(<t)})$ are learned by XGBoost classifiers/regressors.
Numerical Features (Hierarchical Classification): Instead of direct regression, numerical features are quantized into bins. The model uses a hierarchical classifier (a meta-tree of binary XGBoost classifiers) to predict the bin.
- Benefit: This imposes an ordinal inductive bias, ensuring that prediction errors are localized to semantically similar bins rather than distant ones.
De-quantization: To handle the non-continuous nature of real-world data, sampled bins are resampled using interpolated empirical quantile functions (EQF) rather than uniform sampling, preserving skewed distributions.
Categorical Features: Modeled directly via multi-class XGBoost classifiers.
Cardinality Constraint: To prevent training explosion with high-cardinality features, infrequent categories are clustered based on mean-vector embeddings (Gower-like distance) and merged, preserving the joint structure of the data while limiting the number of classes.

3. Key Contributions

Dual-Architecture Framework: Introduced XGenBoost, a pair of models (Diffusion for small data, Autoregressive for large data) that effectively utilize XGBoost for generative tasks.
Native Categorical Handling: Demonstrated that avoiding one-hot encoding by using multinomial diffusion and direct categorical splits in XGBoost yields higher fidelity and avoids the "curse of dimensionality" associated with high-cardinality categorical variables in neural diffusion models.
Scalability via Fixed-Order Factorization: Solved the scalability bottleneck of previous tree-based generative models (like Unmasking Trees) by using a fixed-order autoregressive approach, eliminating the need to expand the training dataset $K$ times.
Hierarchical Probabilistic Prediction: Adapted hierarchical classification for numerical features to provide a natural ordinal inductive bias, outperforming standard multi-class classification in preserving multivariate structure.
Resource Efficiency: Achieved state-of-the-art performance on CPU-only hardware, democratizing access to high-quality synthetic data generation.

4. Experimental Results

The authors evaluated XGenBoost on two benchmarks:

Small Benchmark (27 datasets): Compared against SMOTE, ARF, CTGAN, TVAE, TabDDPM, FD, FF, and UT.
Big Benchmark (11 datasets): Compared against SMOTE, ARF, CTGAN, TVAE, TabDDPM, and TabSyn.

Key Findings:

Fidelity: XGenBoost consistently outperformed baselines in Shape (marginal distributions) and Trend (bivariate correlations).
- XGenB-DF excelled at preserving correlations (Trend) due to its stochastic nature.
- XGenB-AR excelled at preserving marginal distributions (Shape) and handled large datasets efficiently.
Utility (MLE): Models trained on XGenBoost synthetic data achieved competitive or superior performance (ROCAUC/R2) when tested on real data compared to deep learning baselines.
Privacy: XGenBoost demonstrated strong privacy protection (measured by Distance to Closest Record - DCR), often outperforming high-fidelity neural models that tend to overfit.
Training Efficiency:
- XGenB-AR trained in ~3 minutes on the acsincome dataset (1.6M rows) using 16 CPU cores.
- In contrast, deep generative models (TabDDPM, TabSyn) required significantly more time and GPU resources.
- XGenBoost achieved high fidelity at a fraction of the computational cost of neural diffusion models.

5. Significance and Impact

Paradigm Shift: The paper challenges the prevailing trend of adapting NLP/CV architectures (Transformers, Diffusion) to tabular data. Instead, it advocates for a "data-first" approach, designing generative models based on learners (XGBoost) that already possess suitable inductive biases for tabular data.
Democratization: By enabling high-fidelity synthesis on modest CPU hardware, XGenBoost makes advanced data synthesis accessible to researchers and organizations without access to expensive GPU clusters, addressing the global inequity in compute resources.
Sustainability: Reducing the computational load and energy consumption of generative modeling aligns with sustainability goals in AI.
Practical Applicability: The separation into small-data (Diffusion) and large-data (Autoregressive) models provides a practical toolkit for real-world applications where dataset sizes vary significantly.

In conclusion, XGenBoost proves that tree-based ensembles are not only superior for discriminative tasks but, when architected correctly, can also serve as powerful, efficient, and scalable generative models for mixed-type tabular data, outperforming current deep learning standards in both quality and cost.

XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost

1. The "Small Dataset" Problem: The Art Restorer

2. The "Large Dataset" Problem: The Assembly Line

Why is this a Big Deal?

The Bottom Line

1. Problem Statement

2. Methodology: XGenBoost

A. XGenB-DF: For Small Datasets (Diffusion Model)

B. XGenB-AR: For Large Datasets (Autoregressive Model)

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph

How unconstrained machine-learning models learn physical symmetries

Experiential Reflective Learning for Self-Improving LLM Agents

Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions