TabStruct: Measuring Structural Fidelity of Tabular Data

Imagine you are a chef trying to teach a robot to cook a perfect steak. You give the robot a database of thousands of real steak recipes (the "real data"). The robot then tries to generate its own new recipes (the "synthetic data").

In the past, people checked if the robot was doing a good job by asking two simple questions:

Does it taste like a steak? (Density Estimation: Does the new recipe look statistically similar to the old ones?)
Can you use it to win a cooking contest? (ML Efficacy: If you use the robot's recipes to train a new chef, does that chef win awards?)

The Problem:
The authors of this paper, TabStruct, realized there's a hidden flaw. A robot could cheat! It could memorize the flavor of the steak perfectly and help a chef win a contest, but it might completely mess up the physics of cooking. For example, it might generate a recipe that says, "If you add more heat, the steak gets colder."

In the real world, data isn't just a list of numbers; it has a hidden "skeleton" or "causal structure." Just like gravity always pulls things down, variables in a dataset often have cause-and-effect relationships (e.g., "More rain causes more grass growth"). If a robot generates data that breaks these rules, it's useless for scientific discovery or understanding the real world, even if it looks good on a test score.

The Solution: TabStruct
The paper introduces a new way to test these data-generating robots called TabStruct. Think of it as a "Physics Test" for data.

Here is how they do it, using simple analogies:

1. The "Toy vs. Real World" Problem

Previously, researchers tested robots on "Toy Datasets"—simple, made-up worlds where they knew the exact rules (like a video game with known physics). But in the real world (like healthcare or finance), we don't have a "rulebook" to check against. We don't know the true causal structure of a disease or a stock market crash.

The Analogy: Imagine testing a self-driving car only in a simulator where you know exactly where every pothole is. It passes! But when you put it on a real road with unknown potholes, it crashes.

2. The New Metric: "Global Utility"

Since we can't check the "rulebook" for real-world data, the authors invented a clever trick called Global Utility.

The Analogy: Imagine you have a puzzle.

Old way: You check if the puzzle pieces look like the picture on the box (Density Estimation).
New way (Global Utility): You take every single piece of the puzzle, hide it, and ask the robot to guess what that piece is based on all the other pieces.
- If the robot is good, it can guess the hidden piece perfectly because it understands how the pieces fit together (the causal structure).
- If the robot is bad, it guesses randomly because it only memorized the picture, not the logic of how the pieces connect.

By testing every variable this way, they get a score that tells you: "Does this robot understand the deep logic of the data, or is it just faking it?"

3. The Big Discovery

The authors tested 13 different types of data-generating robots (from simple math tricks to complex AI) on 29 different real-world datasets.

The Surprise:

The "Cheaters": Some popular methods (like SMOTE) were great at making data that looked like the original and helped win prediction contests. But when you tested their "physics" (Global Utility), they failed miserably. They broke the causal rules.
The "Architects": Newer models based on Diffusion (a technique similar to how AI generates images by slowly adding and removing noise) turned out to be the best at preserving the true "skeleton" of the data. They didn't just mimic the surface; they understood the deep connections.

Why This Matters

If you are a doctor using AI to generate fake patient data to train a new diagnostic tool, you don't just want data that looks real. You want data that respects the laws of biology. If the AI thinks "taking aspirin causes a fever," your new doctor will learn the wrong lesson and hurt patients.

TabStruct gives us a way to check if the AI is respecting the laws of the universe (or the specific domain) before we trust it with real-world decisions. It shifts the focus from "Does it look good?" to "Does it make sense?"

In a nutshell:

Old Test: "Does the fake data look like the real data?"
TabStruct Test: "Does the fake data follow the same hidden rules and cause-and-effect relationships as the real data?"
Result: We found that many popular AI tools are great at faking the look, but terrible at understanding the logic. We need to start using "Global Utility" to find the ones that truly understand the data.

1. Problem Statement

Evaluating tabular data generators is a significant challenge because existing benchmarks often fail to assess the structural fidelity of synthetic data.

Limitations of Current Metrics: Traditional evaluation dimensions (Density Estimation, ML Efficacy, Privacy Preservation) focus on distribution similarity or downstream task performance. They often overlook the unique causal structural prior of heterogeneous tabular data. For instance, a generator might produce data that performs well in a classification task (high ML efficacy) but violates fundamental physical laws or causal relationships inherent in the data (e.g., changing a variable that should be independent).
The Ground-Truth Gap: Existing structural fidelity metrics (like those in CauTabBench) rely on Structural Causal Models (SCMs) to verify conditional independence (CI). However, ground-truth SCMs are rarely available for real-world datasets, limiting these metrics to toy datasets and preventing holistic evaluation of real-world generators.
Evaluation Bias: Current benchmarks often prioritize ML efficacy, which can be biased toward specific prediction targets, ignoring the global structure of inter-feature relationships.

2. Methodology

The authors propose TabStruct, a comprehensive evaluation framework that integrates structural fidelity with conventional dimensions. The core innovation is a new metric called Global Utility that evaluates structural fidelity without requiring ground-truth SCMs.

A. Evaluation Dimensions

The framework evaluates generators across four complementary dimensions:

Density Estimation: Measures distributional similarity (Shape, Trend, $\alpha$ -precision, $\beta$ -recall).
Privacy Preservation: Measures data leakage risk (DCR, $\delta$ -Presence).
ML Efficacy: Measures downstream task performance (classification/regression).
Structural Fidelity: The novel dimension assessing the preservation of causal structures.

B. Structural Fidelity Metrics

The paper defines structural fidelity at two levels:

Conditional Independence (CI) Scores (SCM-based): Used for datasets with known ground-truth SCMs.
- Global CI: Measures the alignment of the synthetic data's Markov Equivalence Class (represented as a CPDAG) with the ground truth by testing a full set of CI statements.
- Local CI: Focuses only on CI statements involving the prediction target.
Global Utility (SCM-free): Proposed for real-world datasets where SCMs are unknown.
- Concept: Treats every variable in the dataset as a prediction target. An ensemble of downstream predictors is trained to predict each variable $x_j$ using all other variables $X \setminus \{x_j\}$ .
- Calculation: The utility of a variable is the relative performance of the predictor trained on synthetic data compared to the reference data.
- Aggregation: Global Utility is the average utility across all variables. It hypothesizes that a high-fidelity generator allows accurate conditional prediction of any variable from the others, reflecting the underlying causal structure (Markov blanket properties).
- Normalization: To handle heterogeneous task difficulties (e.g., regression vs. classification), scores are normalized to ensure fair aggregation.

C. Experimental Setup

Datasets: 29 datasets total.
- 6 Expert-Validated SCM Datasets: Sourced from bnlearn, featuring ground-truth causal structures for rigorous validation.
- 23 Real-World Datasets: Sourced from TabZilla, OpenML, and UCI, covering classification and regression tasks with varying feature counts (6–145) and sample sizes.
Generators: 13 tabular generators across 9 categories (Interpolation, Bayesian, VAE, GAN, Flow, Tree, Diffusion, EBM, LLM).
Protocol: Nested cross-validation with hyperparameter tuning for both generators and downstream predictors.

3. Key Contributions

TabStruct Benchmark: The first unified benchmark to jointly evaluate structural fidelity and conventional dimensions across a large scale (13 generators, 29 datasets, 9 categories).
Global Utility Metric: A novel, SCM-free metric that effectively quantifies global structural fidelity. It serves as a robust proxy for causal structure preservation in real-world scenarios where ground-truth SCMs are unavailable.
Empirical Insights:
- Diffusion Models Dominate: Diffusion-based models (TabDDPM, TabSyn, TabDiff) consistently achieve the highest Global Utility and Global CI scores. Their permutation-invariant generation process aligns naturally with the unordered nature of tabular features.
- Limitations of Autoregressive/LLM Models: Models like GReaT (LLM-based) struggle with tabular data structure due to the inherent bias introduced by linearizing features into a sequence, which conflicts with the permutation-invariant nature of tables.
- SMOTE's Local vs. Global Trade-off: SMOTE excels at Local Utility (predicting the target) but fails significantly at Global Utility, often violating global causal structures while preserving local density.
- Complementarity: Structural fidelity is not interchangeable with ML efficacy. A model can be excellent for a specific downstream task but poor at preserving the underlying causal physics of the data.

4. Results

Correlation: Global Utility shows a strong monotonic correlation ( $r_s = 0.84$ ) with Global CI scores on SCM datasets, validating its effectiveness as an SCM-free proxy.
Ranking Stability: Global Utility provides stable generator rankings even with small ensembles of predictors (e.g., "Tiny-default"), whereas Local Utility is highly sensitive to predictor choice and requires extensive tuning.
Performance Gaps:
- Top Performers: Diffusion models (TabDDPM, TabSyn, TabDiff) rank highest in structural fidelity.
- Underperformers: Interpolation methods (SMOTE) and some GANs (CTGAN) often prioritize local structure (target prediction) at the expense of global causal relationships.
- Causal Discovery Limitations: Even structure-learning methods like Bayesian Networks (BN) struggle to capture complex tabular structures when feature counts exceed 10, highlighting the difficulty of current causal discovery algorithms.

5. Significance

Paradigm Shift: The paper argues that optimizing tabular generators solely for density estimation or ML efficacy is insufficient. Structural fidelity must be a core evaluation dimension to ensure synthetic data is scientifically valid and trustworthy.
Practical Utility: By introducing Global Utility, the authors provide a practical tool for researchers and practitioners to evaluate synthetic data in real-world scenarios (e.g., healthcare, finance) where ground-truth causal graphs are unknown.
Open Source: The release of the TabStruct suite (datasets, pipelines, and raw results) establishes a standardized, reproducible framework for future research in tabular generative modeling.
Future Directions: The findings suggest that future generator design should incorporate structure-aware inductive biases (e.g., enforcing conditional independence constraints) rather than relying solely on likelihood maximization.

In summary, TabStruct demonstrates that global structural fidelity is a critical, previously underexplored dimension of tabular data generation, and provides the tools and metrics necessary to measure and improve it.

TabStruct: Measuring Structural Fidelity of Tabular Data

1. The "Toy vs. Real World" Problem

2. The New Metric: "Global Utility"

3. The Big Discovery

Why This Matters

1. Problem Statement

2. Methodology

A. Evaluation Dimensions

B. Structural Fidelity Metrics

C. Experimental Setup

3. Key Contributions

4. Results

5. Significance

More like this

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

Logic-Gated Time-Shared Feedforward Networks for Alternating Finite Automata: Exact Simulation and Learnability

CLPIPS: A Personalized Metric for AI-Generated Image Similarity

Runtime Burden Allocation for Structured LLM Routing in Agentic Expert Systems: A Full-Factorial Cross-Backend Methodology

DarwinNet: An Evolutionary Network Architecture for Agent-Driven Protocol Synthesis