XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost

XGenBoost introduces two XGBoost-based generative models—a DDIM variant for small datasets and a hierarchical autoregressive model for large-scale data—that leverage tree-ensemble inductive biases to outperform existing methods in mixed-type tabular synthesis with lower training costs.

Jim Achterberg, Marcel Haas, Bram van Dijk, Marco Spruit

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine you have a massive spreadsheet filled with real-world data—like medical records, bank loan applications, or census information. This data is a goldmine for training AI, but it's also a privacy nightmare. You can't just share the real spreadsheet because it contains people's private secrets.

Enter XGenBoost. Think of it as a master forger that can create a "fake" spreadsheet that looks, feels, and behaves exactly like the real one, but contains zero real people.

The paper introduces a new way to build this forger using XGBoost (a tool usually used for making predictions, like "will this loan default?") instead of the usual heavy-duty AI tools (Deep Neural Networks) that require massive supercomputers.

Here is how XGenBoost works, broken down into two simple scenarios:

1. The "Small Dataset" Problem: The Art Restorer

The Scenario: You have a small, precious dataset (like a rare medical study with only 500 patients).
The Old Way: Trying to use a giant, complex AI (like a Diffusion Model) on this is like trying to restore a tiny, fragile watercolor painting with a sledgehammer. It's overkill, expensive, and might ruin the delicate details.
The XGenBoost Solution (XGenB-DF):

  • The Metaphor: Imagine a master art restorer who looks at the painting and tries to "un-blur" it. They don't just guess; they use a very specific, efficient tool (XGBoost) to figure out how to remove the noise and reveal the original image.
  • How it works: The model starts with a pile of random noise (static) and slowly cleans it up, step-by-step, until a perfect fake dataset emerges.
  • The Magic Trick: Real data has a mix of numbers (age, income) and categories (gender, city). Most AI tools force everything into numbers first (like turning "Red" into "1" and "Blue" into "2"), which loses nuance. XGenBoost handles them natively, like a chef who knows exactly how to handle both a delicate fish and a tough steak without turning them into a smoothie.
  • Result: It creates high-quality fakes from small data without needing a supercomputer.

2. The "Large Dataset" Problem: The Assembly Line

The Scenario: You have a massive dataset (like millions of bank transactions).
The Old Way: The "Art Restorer" method above is too slow for millions of rows. It would take forever to clean the noise one by one.
The XGenBoost Solution (XGenB-AR):

  • The Metaphor: Imagine a factory assembly line. Instead of trying to build the whole car at once, the factory builds it part by part. First, they build the engine. Then, they build the chassis based on the engine. Then the wheels based on the chassis.
  • How it works: This is called an Autoregressive model. It asks a series of simple questions: "Given the age and income, what is the likely city?" Then, "Given the age, income, and city, what is the likely job?"
  • The Secret Weapon: Instead of using a giant, slow neural network to answer these questions, it uses XGBoost classifiers. Think of XGBoost as a team of expert detectives. Each detective is great at spotting patterns in specific clues. They work together in a hierarchy to build the fake data row by row.
  • Result: It can churn out millions of fake records in minutes on a standard computer, whereas other methods might take hours or days on expensive GPUs.

Why is this a Big Deal?

  1. Democratization (The "Forklift" vs. The "Ferrari"):
    Most current AI tools for making fake data are like Ferraris. They are fast and powerful, but they only work if you have a massive garage (supercomputers/GPUs) and a lot of money. XGenBoost is like a reliable, efficient forklift. It gets the job done just as well, but it runs on a standard truck engine (a regular CPU). This means researchers in developing countries or small hospitals can use it without needing a tech giant's budget.

  2. Respecting the Data's Nature:
    Real-world data is messy. It has categories (like "Yes/No") and numbers (like "Salary").

    • Old AI: Often tries to force categories into numbers (One-Hot Encoding), which is like trying to measure the "taste" of a fruit by weighing it. It's clumsy.
    • XGenBoost: Uses the natural strength of tree-based models (XGBoost) which are already experts at splitting data based on categories. It doesn't force a square peg into a round hole.
  3. Privacy vs. Quality Balance:
    The paper shows that XGenBoost can create data that is so realistic that other AI models can't tell the difference between the real and fake data (High Fidelity), but it also includes safety features (like "dropout") to ensure it doesn't accidentally memorize and leak a specific person's real data.

The Bottom Line

The authors are saying: "Stop trying to force square pegs (neural networks) into round holes (tabular data)."

By using tools that were already designed to handle mixed-type data efficiently (XGBoost), they built a system that is cheaper, faster, and just as accurate as the expensive, heavy-duty AI methods currently in use. It's a "data-first" approach that respects the nature of the information it's trying to synthesize.