Improving TabPFN's Synthetic Data Generation by Integrating Causal Structure

Imagine you are trying to teach a brilliant but slightly confused robot how to bake a cake. You want the robot to learn the recipe so well that it can bake a new cake from scratch that tastes just like the original, but without using any of the actual ingredients from the first one. This is what synthetic data generation is: creating fake data that looks and acts like real data, which is super useful for things like testing new medicines or financial models without risking real people's privacy.

The robot in this story is called TabPFN. It's a very smart AI that has read millions of fake recipes (datasets) and learned how to bake. However, the researchers in this paper found a major glitch in how TabPFN works.

The Problem: The "Wrong Order" Glitch

TabPFN is like a robot that reads a recipe one word at a time, from left to right. It guesses the next ingredient based only on the ones it has already read.

The Real World: In a real cake recipe, some ingredients depend on others. You can't put the frosting on before the cake is baked. The "cause" (baking) must happen before the "effect" (frosting).
The Glitch: If you give the robot a recipe where the words are scrambled (e.g., "Frosting" comes before "Flour"), the robot gets confused. It tries to guess the flour based on the frosting. It starts inventing fake connections, like thinking "Frosting causes Flour to appear."

In the real world, this is like thinking that umbrellas cause rain just because you always see umbrellas when it rains. The robot creates spurious correlations (fake links) that mess up the data. If you use this bad data to test a new drug, you might think the drug works when it actually doesn't, or vice versa.

The Solution: Giving the Robot a Map

The researchers realized that to fix this, they needed to give the robot a map of the kitchen (a Causal Structure). This map shows exactly which ingredients depend on which others.

They tried two new ways to help the robot:

1. The "Perfect Map" Strategy (DAG-Aware)

Imagine you have a perfect, complete map of the kitchen showing every single dependency.

How it works: Instead of just reading left-to-right, the robot looks at the map. It says, "Okay, I need to bake the cake before I can frost it." It only looks at the ingredients that actually cause the next step.
The Result: The robot bakes a perfect fake cake. The data is high-quality, and the fake relationships are real.

2. The "Sketchy Map" Strategy (CPDAG-Based)

In the real world, we rarely have a perfect map. Sometimes we only know some connections (e.g., we know "Heat causes Cake," but we aren't sure if "Sugar" causes "Flour" or the other way around).

How it works: The robot uses a "sketchy map" (called a CPDAG). For the parts of the map that are clear, it follows the rules. For the blurry parts where the direction is unknown, it falls back to its old habit of just reading left-to-right.
The Result: It's not as perfect as the "Perfect Map," but it's still much better than having no map at all. It prevents the biggest mistakes, even if the map isn't 100% complete.

Why This Matters: The "Fake Patient" Test

The researchers tested this on a very important scenario: Medical Research.

Imagine you are testing a new drug. You have a small group of real patients. You want to generate thousands of "fake patients" to see how the drug works without hurting real people.

Without the fix: If the robot gets the order wrong, it might create fake patients where the drug seems to cure a disease, but only because the robot confused the cause and effect. This could lead to dangerous medical decisions.
With the fix: The robot respects the true cause-and-effect relationships. The fake patients behave realistically. If the drug works in the fake data, it's much more likely to work in the real world.

The Bottom Line

The paper shows that order matters. Just like you can't build the roof before the foundation, an AI shouldn't guess an effect before its cause.

By teaching the AI to respect the causal structure (the "why" and "how" of the data) rather than just the order of the columns, the researchers made the fake data much more reliable. It's like giving the robot a chef's intuition instead of just a list of words to memorize. This ensures that when we use AI to simulate the future, we aren't just making up stories—we are building a realistic model of the world.

1. Problem Statement

Synthetic tabular data generation is critical for addressing data scarcity and privacy concerns in fields like healthcare and finance. TabPFN (Tabular Prior-Data Fitted Network) is a recent foundation model that generates high-quality synthetic data by leveraging in-context learning on millions of pre-trained synthetic datasets derived from Structural Causal Models (SCMs).

However, TabPFN employs an autoregressive generation strategy, where features are generated sequentially based on the order they appear in the input data. The paper identifies a fundamental limitation:

Order Sensitivity & Spurious Correlations: The model conditions each new variable on all previously generated features. If the input feature order conflicts with the underlying causal structure (e.g., generating a "child" variable before its "parent" or conditioning on a "collider" before its parents), the model introduces spurious correlations.
Collider Bias: In a collider structure ( $X \to Z \leftarrow Y$ ), $X$ and $Y$ are marginally independent. If the generation order forces the model to condition on $Z$ (the collider) while generating $X$ or $Y$ , it incorrectly induces a dependency between $X$ and $Y$ . This distorts the marginal distribution of the synthetic data and compromises the preservation of causal effects, such as Average Treatment Effects (ATE).

2. Methodology

The authors propose integrating causal structure into TabPFN's generation process to align the autoregressive order with causal dependencies. They introduce two complementary strategies:

A. DAG-Aware Conditioning (Full Causal Knowledge)

When the true causal graph (Directed Acyclic Graph, or DAG) is known:

Topological Ordering: Variables are generated following a topological sort of the DAG (parents before children).
Restricted Conditioning: Instead of conditioning on all previously generated features, the model conditions only on the causal parents of the current variable.
Formula: For a variable $x_i$ , the conditioning set is $C(x_i) = \{x_j : x_j \to x_i \text{ in } G\}$ , rather than the full set of predecessors.

B. CPDAG-Based Conditioning (Partial Causal Knowledge)

In real-world scenarios, the full DAG is often unknown. Causal discovery algorithms typically return a Completed Partially Directed Acyclic Graph (CPDAG), which contains both directed edges (known causal directions) and undirected edges (ambiguous directions).

Hybrid Strategy: The authors propose a generation ordering $\sigma$ that prioritizes variables with known causal parents.
Conditional Logic:
- If a variable has fully directed parents in the CPDAG, it is conditioned only on those parents.
- If a variable has undirected edges or no known parents, the model reverts to the standard vanilla sequential conditioning (conditioning on all predecessors in the ordering).
This approach allows the model to leverage partial causal knowledge without committing to potentially incorrect edge orientations.

3. Experimental Design

The study evaluates these methods across three dataset classes:

Custom Collider SCM: A controlled 4-variable synthetic dataset designed specifically to test sensitivity to collider bias.
CSuite Benchmarks: Six hand-crafted SCMs with varying complexities (Simpson's paradox variants, confounding, large backdoor paths).
SimGlucose: A realistic clinical dataset (Type 1 Diabetes) with 38 variables and partial causal knowledge derived from an FDA-approved simulator.

Evaluation Metrics:

Structural Fidelity: Correlation Matrix Difference (CMD) and k-Marginal Total Variation Distance (kMTVD).
Privacy: Nearest-Neighbor Adversarial Accuracy (NNAA).
Causal Preservation: Average Treatment Effect (ATE) error, measuring how well the synthetic data preserves the true causal effect of an intervention.

Baselines:

Vanilla TabPFN (original column order).
Vanilla TabPFN (topological order).
Vanilla TabPFN (reverse topological order).
DAG-aware generation.
CPDAG-based generation (using both minimal CPDAGs and those discovered via the PC-stable algorithm).

4. Key Results

A. Impact of Feature Ordering

Vanilla TabPFN is highly sensitive to column order. Topological ordering significantly improves data quality (lower CMD and kMTVD) compared to the original or reverse orderings.
Reverse topological ordering (children before parents) consistently degrades performance, confirming that violating causal direction introduces spurious correlations.

B. DAG-Aware Generation

Superior Performance: Conditioning on causal parents (rather than all predecessors) yields significant improvements in CMD, kMTVD, and NNAA across most datasets and training sizes.
ATE Preservation: DAG-aware generation substantially reduces ATE estimation errors compared to vanilla TabPFN, particularly in low-data regimes ( $N=20$ ). On the custom collider SCM, it reduced ATE error by ~1.23 units at $N=20$ .
Robustness: These benefits persist even with higher noise levels ( $\sigma = 10^{-2}$ ), proving the method is not limited to near-deterministic systems.

C. CPDAG-Based Generation

Partial Knowledge is Valuable: Minimal CPDAGs (orienting only V-structures) show moderate improvements, particularly when the oriented edges are central to the causal chain.
Discovery Limitations: CPDAGs discovered automatically via the PC-stable algorithm showed mixed results. When the discovery algorithm oriented edges with low precision (e.g., on the CMC dataset), performance degraded. This highlights that incorrect causal assumptions are worse than no assumptions.
Hybrid Safety: The hybrid strategy (falling back to vanilla conditioning for undirected edges) prevents the catastrophic failure seen when using fully oriented but incorrect DAGs (e.g., from the REX algorithm).

5. Significance and Contributions

Diagnosis of Autoregressive Bias: The paper provides empirical evidence that the autoregressive nature of foundation models like TabPFN creates a "causal blind spot," where feature ordering dictates data quality and causal validity.
Causal Conditioning Framework: It introduces a practical framework for injecting causal structure into autoregressive generation, bridging the gap between foundation models and causal inference.
Reliability in Critical Domains: By preserving causal effects (ATE) and reducing spurious correlations, the proposed methods make synthetic data safer for high-stakes applications like drug development and clinical trials, where incorrect treatment effect estimates could lead to costly failures or patient harm.
Handling Uncertainty: The CPDAG-based approach offers a robust solution for real-world scenarios where full causal graphs are unavailable, demonstrating that partial structural knowledge can still enhance generation quality if applied conservatively.

Conclusion

Integrating causal structure into TabPFN's generation process transforms it from a purely statistical generator into a causally aware synthesizer. By aligning the generation order and conditioning sets with the underlying causal graph (or partial graph), the authors significantly improve the structural fidelity, privacy, and causal validity of synthetic tabular data. This work suggests that for autoregressive foundation models, causal reasoning is not optional but essential for generating reliable synthetic data.