Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals

Imagine you are a bank trying to catch a thief. You don't just look at what they bought; you look at how they bought it. Did they swipe their card 50 times in one minute? Did they buy a $5,000 TV and a $5 coffee in the same second? Did they use the same Wi-Fi router as 20 other people who just bought gift cards?

These "behavioral fingerprints" are how banks catch fraud.

Now, imagine you want to train your AI to catch these thieves, but you can't show it real bank data because of privacy laws (like GDPR). So, you ask a computer program to make up fake bank data that looks just like the real thing. This is called "Synthetic Data."

The Big Problem:
This paper argues that the current "fake data makers" are terrible at copying the behavior. They are great at copying the stats, but they fail to copy the story.

Here is the breakdown using simple analogies:

1. The "Statistical Mirror" vs. The "Behavioral Ghost"

Think of the current fake data generators as photocopiers.

What they do well (Statistical Fidelity): If you have a bag of 1,000 marbles (990 blue, 10 red), the photocopier makes a new bag with 990 blue and 10 red marbles. It gets the counts right. It also gets the average size of the marbles right.
What they fail at (Behavioral Fidelity): In the real world, the 10 red marbles (the fraudsters) are usually clumped together in a tight, frantic pile because they are acting fast. The fake data maker scatters those 10 red marbles randomly throughout the bag.
The Result: The AI trained on the fake data thinks fraudsters are calm and scattered. When it goes to the real world, it misses the frantic, clumped-up thieves because it was never taught to look for that specific "panic pattern."

2. The Three Layers of Testing

The authors created a new way to test these fake data makers, like a three-level video game:

Level 1: The Look (Statistical Fidelity): Does the fake data look like the real data on a spreadsheet? (e.g., Are the average transaction amounts the same?)
- Verdict: The generators pass this easily.
Level 2: The Test Score (Downstream Utility): If we train a fraud detector on the fake data, does it get a good grade on a test?
- Verdict: The generators pass this too! They get high scores.
Level 3: The Behavior (Behavioral Fidelity): Does the fake data actually act like a fraudster?
- Verdict: CATASTROPHIC FAILURE. This is where the paper breaks the news. The generators fail miserably here.

3. The Four "Behavioral Crimes" They Missed

The authors defined four specific ways fraudsters behave that the fake data completely destroys:

P1: The "Speed Burst" (Inter-Event Time): Real fraudsters act fast. They buy, buy, buy in seconds.
- The Fake Data: The fake transactions are spaced out randomly, like a normal person shopping over a week. The "burst" is gone.
P2: The "Sprint and Vanish" (Burst Structure): Fraudsters often do a quick burst of activity and then disappear for days.
- The Fake Data: The fake accounts are either always active or never active. They don't have that "sprint" rhythm.
P3: The "Secret Club" (Shared Infrastructure): Real fraud rings often share devices (like one laptop used by 50 different fake accounts).
- The Fake Data: The generators give every single fake account its own unique, brand-new laptop. The "Secret Club" structure is completely erased.
- Analogy: It's like a detective trying to find a gang of thieves, but the fake data shows every thief using a different, unconnected car. The detective can't see the connection.
P4: The "Speed Limit" (Velocity Rules): Banks have rules like "If you buy more than 3 items in 1 hour, we flag you."
- The Fake Data: Because the fake data is so spread out, these rules almost never trigger. If you tune your alarm system based on this fake data, you will set the sensitivity way too low, and the real thieves will walk right past your alarm.

4. The "Row-Independence" Trap

Why do these generators fail? The paper explains a fundamental flaw in how they work.

Imagine a writer trying to write a novel where every character is a bank customer.

Current Generators: They write one page at a time, completely forgetting the previous page. They write "Customer A bought a shoe." Then they write "Customer B bought a hat." They don't know that Customer A and B are actually the same person acting fast, or that they are using the same computer.
The Problem: Because they generate each row (transaction) independently, they physically cannot create the complex connections (like shared devices) or the time-based rhythms (like bursts) that define fraud.

Even the most advanced "Autoregressive" generator (which tries to look at the previous item in the same row) couldn't fix this. It's like trying to understand a conversation by only reading one sentence at a time without remembering the previous sentence.

5. The Conclusion: Don't Trust the Fake Data Yet

The paper concludes that we cannot currently use synthetic data to train fraud detection systems if those systems rely on timing, speed, or shared devices.

The Good News: The authors released their "Behavioral Fidelity" test kit as open source. Now, anyone can test if a new fake data generator actually preserves these "behavioral ghosts."
The Bad News: Until we invent a new type of AI that can remember the "story" of a customer across multiple transactions and devices, using fake data for fraud detection is like training a security guard with a map of a city that has no streets—only random dots.

In short: The current fake data looks like the real thing, but it doesn't act like the real thing. And in the world of fraud, action is everything.

1. Problem Statement

Financial fraud detection relies heavily on behavioral signals—specifically temporal bursts, velocity rule violations, and shared-infrastructure graph motifs—rather than just static statistical distributions. While synthetic data is widely used to protect privacy (e.g., GDPR compliance), existing evaluation frameworks for synthetic tabular data focus on:

Statistical Fidelity: Marginal distributions and pairwise correlations.
Downstream Utility: Classifier performance (e.g., AUROC) when trained on synthetic data and tested on real data (TSTR protocol).

The Gap: These metrics fail to capture behavioral fidelity. A generator can perfectly match the marginal distribution of transaction amounts and timestamps while completely destroying the within-entity sequential structure (e.g., the "burst" pattern where a fraudster makes 3 transactions in 60 seconds). Consequently, models trained on such synthetic data may fail to calibrate velocity rules or detect fraud rings in production.

2. Methodology

A. Behavioral Fraud Pattern Taxonomy (P1–P4)

The author defines four measurable behavioral patterns grounded in fraud detection literature:

P1: Inter-Event Time (IET) Distribution: Measures the distribution of time gaps between transactions for a single entity. Crucially, it evaluates within-entity autocorrelation (do short gaps follow short gaps?), which is the fingerprint of fraud bursts.
P2: Burst Structure & Active Lifetime: Analyzes the density of transactions within a specific time window (burst length) and the total active lifetime of an account. Fraudsters typically have short, dense lifetimes.
P3: Shared-Infrastructure Graph Motifs: Evaluates bipartite graph structures where entities (users) share attributes (device IDs, IPs). Real fraud rings exhibit high "fan-out" (many users sharing one device) and high clustering coefficients.
P4: Velocity-Rule Trigger Rates: Measures the rate at which synthetic data triggers standard industry velocity rules (e.g., ">3 transactions in 1 hour").

B. Evaluation Framework: Degradation Ratio (DR)

To make these heterogeneous metrics comparable, the paper introduces a Degradation Ratio anchored to a real-data noise floor:
$DR(G, m) = \frac{\text{metric}(D_{real}, D_{syn})}{\text{metric}(D_{real, A}, D_{real, B})}$

Denominator: The difference between two random 50/50 splits of the real data (the irreducible sampling noise).
Interpretation: A score of 1.0 means the synthetic data is indistinguishable from real data splits. A score of 30.0 means the synthetic data is 30 times worse than natural sampling variability.

C. Experimental Setup

Generators Evaluated: CTGAN, TVAE, GaussianCopula (SDV), and TabularARGN (MOSTLY AI).
Datasets:
- IEEE-CIS Fraud Detection (2019): Used for P1, P2, and P4 (rich in entity sequences).
- Amazon Fraud Dataset (2020): Used for P3 (rich in device/IP co-occurrence).
Protocol: A three-layer evaluation: Layer 1 (Statistical), Layer 2 (Downstream Utility/TSTR), and Layer 3 (Behavioral Fidelity).

3. Key Contributions

Behavioral Fidelity Dimension: Formalizes a third evaluation axis specifically for sequential/behavioral data, proving that high statistical fidelity does not imply behavioral fidelity.
Formal Taxonomy & Metrics: Defines P1–P4 with precise mathematical definitions (Wasserstein distance, autocorrelation, graph clustering) and a noise-floor-anchored scoring system.
Theoretical Impossibility Proofs:
- Proposition 1: Proves that row-independent generators (which sample rows i.i.d.) are structurally incapable of reproducing heavy-tailed fan-out distributions (P3) required for fraud ring detection.
- Proposition 2: Proves that row-independent generators with post-hoc entity assignment produce non-positive within-entity IET autocorrelation, making the positive "burst fingerprint" of fraud mathematically unachievable.
Comprehensive Benchmark: The first benchmark to evaluate multiple generators against a formal behavioral fraud taxonomy across temporal, velocity, and graph dimensions.

4. Key Results

A. Catastrophic Failure Across All Generators

All four generators failed to preserve behavioral patterns, with degradation ratios ranging from 17.2× to 99.7× (where 1.0 is perfect).

IEEE-CIS (P1, P2, P4):
- TVAE: Best composite score (24.4×), but still 24 times worse than real-data noise. Note: This required conditional sampling to fix a minority-class collapse failure.
- CTGAN: 32.2×.
- TabularARGN: 36.3×.
- GaussianCopula: 39.0×.
- Observation: High TSTR AUROC (downstream utility) did not predict low behavioral degradation. CTGAN had high AUROC (0.798) but terrible P3 scores.
Amazon FDB (P3 - Graph Motifs):
- Row-Independent Generators (CTGAN, TVAE, GaussianCopula): Scores ranged 81.6× – 99.7×. They collapsed the fan-out distribution to ~1 (every user gets a unique device), destroying fraud ring structures.
- TabularARGN: Achieved 17.2× (best result). Its autoregressive architecture (conditioning features within a row) allowed it to implicitly learn some co-occurrence patterns, but it still failed to match real data.

B. Specific Failure Modes

TVAE Minority-Class Collapse: Without explicit conditional sampling, TVAE generated almost zero fraud cases (0.03% vs. 3.5% real), rendering it useless for fraud workflows.
CTGAN Scalability: Failed to train on full high-dimensional datasets due to one-hot encoding explosions with integer columns.
Architectural Limits: Even the best generator (TabularARGN) could not solve P1/P2/P4 temporal patterns because its autoregression is within-row, not across-row.

5. Significance and Implications

Operational Risk: Using current synthetic data for fraud detection workflows (velocity rule calibration, ring detection, sequence anomaly scoring) is unsafe. Thresholds tuned on synthetic data will be miscalibrated in production, leading to high false negatives or positives.
Beyond Fraud: The findings apply to any domain with entity-level sequential data, including Healthcare (patient visit sequences), E-commerce (user clickstreams), and Network Security (intrusion detection).
Theoretical Limit: The paper proves that row-independence is a fundamental bottleneck. To achieve behavioral fidelity, future generators must move beyond single-row synthesis to:
- Entity-aware sequential generation (modeling $P(transaction_t | history_{t-1})$ ).
- Cross-entity relational modeling (generating groups of rows jointly to preserve graph motifs).
Practitioner Guidance: The author provides actionable workarounds (e.g., conditional sampling for TVAE, disabling value protection for TabularARGN) and mandates Layer 3 (Behavioral) evaluation before deploying synthetic data in sensitive domains.

Conclusion

The paper concludes that no current tabular generator is a suitable substitute for real fraud data in workflows dependent on temporal, velocity, or graph-structural signals. The "behavioral fidelity" gap is not a training issue but a structural limitation of the dominant row-independent paradigm. The authors release an open-source framework to enable the community to measure and improve upon this critical dimension.