ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning

Imagine you are trying to teach a robot chef how to cook a perfect meal. Usually, you'd give the robot a massive library of recipes (data) and say, "Learn everything about ingredients, cooking times, and flavors."

But what if you only have three recipes to teach the robot? And what if two of those recipes are for "Spicy Tacos" and only one is for "Vegetable Soup"?

If you ask the robot to memorize every detail of those three recipes (the exact brand of tortilla, the specific brand of hot sauce, the weather on the day you cooked), it will get confused. It might start thinking that "Spicy Tacos" are always made with "Vegetable Soup" ingredients, or that a "CEO" (a high-paying job) can only earn less than $50,000 a year. This is the problem with current AI data generators: they try to learn the whole picture when they don't have enough information, leading to weird, unrealistic results.

ReTabSyn is a new method that changes the game. Instead of asking the robot to memorize the whole library, it teaches the robot to focus on the most important rule: "If the ingredients are X, the dish should be Y."

Here is how it works, broken down into simple concepts:

1. The Problem: The "Over-Thinker" Robot

Current AI models try to learn the Joint Distribution. Think of this as trying to memorize every single detail of a photo: the color of the sky, the texture of the grass, the number of blades of grass, and the exact position of a cloud.

The Issue: When data is scarce (small sample size) or unbalanced (mostly one type of data), the robot gets overwhelmed. It starts making up facts. It might generate a record of a "CEO" earning less than a minimum wage, or a "45-year-old" who is a "kindergarten teacher" but has a salary of $10 million. These are unrealistic and useless for training other AI models.

2. The Solution: The "Smart Tutor" (ReTabSyn)

The authors argue: "Why memorize the whole photo? Just learn the cause-and-effect."

The Shift: Instead of learning everything, ReTabSyn focuses on the Conditional Distribution: Given these features (Job, Age), what is the likely outcome (Income)?
The Analogy: Imagine a tutor who doesn't care if you know the exact color of the student's shirt. The tutor only cares that if the student says "I am a CEO," the tutor knows the salary should be high. If the student says "I am a CEO" but the salary is low, the tutor immediately says, "No, that's wrong!"

3. How It Learns: The "Good vs. Bad" Game (Reinforcement Learning)

ReTabSyn uses a technique called Direct Preference Optimization (DPO). Think of this as a game of "Spot the Fake."

The Setup: The AI generates a row of data (e.g., "Job: CEO, Income: $20k").
The Trick: The system creates a "Bad" version by messing up the logic (e.g., changing the income to $20k when it should be high, or changing the job to "CEO" when the income is low).
The Feedback: The system tells the AI: "This 'Bad' version is wrong. The 'Good' version is right."
The Result: The AI learns to punish unrealistic combinations and reward logical ones. It doesn't need a human to check every answer; the rules of logic (like "CEOs usually make more money") act as the referee.

4. Why It's a Big Deal

It Works with Less Data: Because it focuses on the important rules (logic) rather than memorizing every detail, it works incredibly well even when you only have a tiny amount of data.
It Handles Imbalance: If you have 99% "Healthy Patients" and 1% "Sick Patients," other AIs ignore the sick ones. ReTabSyn learns the specific rules that make a patient "sick," so it can generate realistic examples of sick patients to help doctors train better.
It's Privacy-Friendly: It doesn't need a human expert to label every single piece of data. It uses the structure of the data itself to teach the AI, which is faster and cheaper.

The Bottom Line

ReTabSyn is like a smart student who, instead of trying to memorize an entire encyclopedia, learns the logic of the universe. When data is scarce, this logic-based approach prevents the AI from hallucinating nonsense (like a CEO earning $20k) and ensures that the fake data it creates is actually useful for training real-world AI models in fields like healthcare and finance.

It's not just about making fake data; it's about making smart fake data that respects the rules of reality.

Here is a detailed technical summary of the paper "ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning."

1. Problem Statement

Deep Generative Models (DGMs) are widely used to address data scarcity and privacy concerns by generating synthetic training data. However, in practical scenarios involving small sample sizes, class imbalance, or distribution shifts, existing tabular DGMs (such as GANs, VAEs, Diffusion models, and Autoregressive LMs) often fail.

The Core Issue: Standard DGMs attempt to learn the full joint distribution $P(X, y)$ . In low-data regimes, the statistical budget is insufficient to accurately model both the marginal feature distribution $P(X)$ and the conditional label distribution $P(y|X)$ .
The Consequence: The models often prioritize marginal realism (generating realistic-looking features) at the expense of the decision boundary, leading to synthetic data that contains unrealistic correlations (e.g., a CEO earning less than $50k) and degrades downstream machine learning performance.
Theoretical Insight: Citing recent work (Xu et al., 2023), the authors argue that perfect joint distribution matching is unnecessary. To maximize downstream utility, a synthesizer only needs to accurately model the conditional distribution $P(y|X)$ , even if the marginal feature distribution is imperfect.

2. Methodology: ReTabSyn

The authors propose ReTabSyn (Reinforced Tabular Synthesis), a framework that aligns a pre-trained tabular generator with decision-relevant structures using Direct Preference Optimization (DPO).

A. Theoretical Foundation

The method is grounded in a utility bound decomposition where the "Utility Gap" between synthetic and real data is driven primarily by Regression Mismatch (error in $P(y|X)$ ) rather than Feature Mismatch (error in $P(X)$ ). Therefore, the training objective must prioritize preserving the conditional relationship between features and targets.

B. Oracle-Free Preference Construction

Unlike Reinforcement Learning from Human Feedback (RLHF) or other tabular RL methods that rely on external oracle classifiers or human labels, ReTabSyn constructs preference pairs intrinsically using the data schema:

Target Perturbation (Type I): For a given row with features $X$ and target $y$ , the target is perturbed to a different value $\tilde{y}$ sampled from the marginal distribution. The original pair $(X, y)$ is the Chosen sample, and $(X, \tilde{y})$ is the Rejected sample. This forces the model to learn the correct conditional logic.
Feature Perturbation (Type II): Strongly correlated feature pairs are identified. One feature is perturbed while the other is held fixed to create unrealistic co-occurrences. The original row is Chosen, and the perturbed row is Rejected. This maintains structural consistency.
Sampling Strategy: A mix of 70% Target Perturbations and 30% Feature Perturbations is used to balance decision boundary alignment with feature structure preservation.

C. Direct Preference Optimization (DPO)

The pre-trained generator (based on the GReaT framework with a GPT-2 backbone) is fine-tuned using DPO.

Input: The "prompt" is the serialization of all columns except the perturbed variable; the "response" is the perturbed variable.
Objective: The model maximizes the log-likelihood margin between the Chosen and Rejected completions while regularizing against drifting too far from the pre-trained feature manifold.
Advantage: This approach is oracle-free and human-label-free, eliminating bias from imperfect external classifiers and reducing computational/privacy costs.

D. Data Augmentation

To prevent overfitting on small datasets ( $N < 1000$ ), the authors employ a SMOTE-like interpolation strategy within categorical buckets before the DPO stage, enriching the manifold support without distorting conditional dependencies.

3. Key Contributions

Oracle-Free Preference Construction: Introduces a target-consistent perturbation strategy that generates high-purity chosen-rejected pairs without external reward models or human annotations.
Decision-Focused Alignment: Theoretically motivated to prioritize $P(y|X)$ over $P(X)$ , effectively closing the utility gap in low-data and rare-event regimes.
Robust Benchmarks: Comprehensive evaluation across 10 datasets under three challenging regimes: small data, extreme class imbalance (1% prevalence), and distribution shift.
Controllability: Demonstrates the ability to enforce expert-specified constraints (e.g., logical rules like "Widow $\implies$ Female") naturally through the preference perturbation mechanism.

4. Experimental Results

The authors evaluated ReTabSyn against strong baselines (SMOTE, TVAE, TabSyn, GReaT, PTA, SynRL) across 10 real-world datasets.

Small Data Regime: ReTabSyn consistently outperformed all baselines in downstream AUROC scores. Notably, with extremely limited data (32–128 rows), ReTabSyn sometimes surpassed the performance of models trained on real data alone, likely due to the volume of high-utility synthetic samples.
Imbalanced Data: In 1% prevalence settings, ReTabSyn achieved the highest PR-AUC scores, significantly outperforming SMOTE and other generators (e.g., 0.906 vs. 0.891 on the Adult dataset).
Distribution Shift: Under covariate shifts (training on one demographic, testing on another), ReTabSyn maintained the closest performance to the real-data upper bound, indicating superior preservation of feature-target dependencies across subpopulations.
Statistical Fidelity: ReTabSyn achieved top-tier scores in correlation similarity and shape metrics, preserving inter-feature structures better than unconditional generators.
Privacy: ReTabSyn demonstrated a favorable privacy-utility trade-off. It significantly reduced membership inference leakage compared to interpolation-based methods (SMOTE) while maintaining utility comparable to or better than deep generative baselines.
Ablation Studies:
- Removing the DPO stage or the augmentation stage degraded performance, confirming the necessity of the two-stage pipeline.
- Increasing the proportion of DPO steps (focusing on conditional alignment) improved downstream utility without sacrificing fidelity.
- The oracle-free perturbation strategy outperformed variants using external classifiers (oracles), proving that oracle noise degrades preference quality.

5. Significance and Impact

Paradigm Shift: ReTabSyn challenges the standard assumption that generative models must perfectly learn the joint distribution. It demonstrates that conditional alignment is the critical factor for downstream utility in data-scarce environments.
Practical Utility: By removing the need for external reward models or human labels, ReTabSyn lowers the barrier to entry for high-quality synthetic data generation in sensitive domains (healthcare, finance).
Ethical & Privacy: The method offers a practical path to controllable synthesis, allowing for the enforcement of domain constraints and fairness rules while maintaining strong privacy guarantees against membership inference attacks.
Future Directions: The authors plan to extend this approach to diffusion backbones and incorporate explicit controls for fairness and adversarial robustness.

In summary, ReTabSyn provides a theoretically grounded, efficient, and robust solution for generating high-utility synthetic tabular data, specifically addressing the limitations of current methods in low-data and imbalanced settings.

ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning

1. The Problem: The "Over-Thinker" Robot

2. The Solution: The "Smart Tutor" (ReTabSyn)

3. How It Learns: The "Good vs. Bad" Game (Reinforcement Learning)

4. Why It's a Big Deal

The Bottom Line

1. Problem Statement

2. Methodology: ReTabSyn

A. Theoretical Foundation

B. Oracle-Free Preference Construction

C. Direct Preference Optimization (DPO)

D. Data Augmentation

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Modeling extremal dependence in multivariate and spatial problems: a practical perspective

Identifying Treatment Effect Heterogeneity with Bayesian Hierarchical Adjustable Random Partition in Adaptive Enrichment Trials

Comparative e-backtests for general risk measures

Estimating the distance at which narwhal (Monodon monoceros)(\textit{Monodon monoceros})(Monodon monoceros) respond to disturbance: a penalized threshold hidden Markov model

Either a Confidence Interval Covers, or It Doesn't (Or Does It?): A Model-Based View of Ex-Post Coverage Probability

Estimating the distance at which narwhal $(\textit{Monodon monoceros})$ respond to disturbance: a penalized threshold hidden Markov model