Leakage Safe Graph Features for Interpretable Fraud Detection in Temporal Transaction Networks

Imagine you are a detective trying to catch a group of thieves in a bustling, high-tech city. The city is a massive network of people sending money to each other every second. Your job is to spot the bad guys before they disappear.

This paper is about building a smarter, fairer, and more honest way for your detective team to do their job.

Here is the story of the paper, broken down into simple concepts:

1. The Problem: The "Crystal Ball" Trap

Traditionally, when computers try to find fraud, they look at two things:

The Person: "Did this person send a weird amount of money at 3 AM?" (Transaction attributes).
The Neighborhood: "Is this person connected to a bunch of other suspicious people? Are they the center of a huge money hub?" (Graph structure).

The Trap: The problem is that many computer models cheat. They act like they have a crystal ball. When analyzing a transaction that happened on Monday, the model might accidentally peek at data from Friday to decide if Monday's transaction was bad.

In the real world, you can't use Friday's information to solve Monday's crime. If you do, your model looks amazing in the lab but fails miserably when deployed in the real world. This is called "Look-Ahead Bias" or "Leakage."

2. The Solution: The "Time-Traveler's Rule"

The authors of this paper created a strict rule: "You can only use information that existed before the moment you are analyzing."

They built a system that acts like a detective who is strictly forbidden from reading tomorrow's newspaper.

If they are investigating a transaction at 10:00 AM, they can only look at the network connections that happened up to 10:00 AM.
They ignore everything that happens at 10:01 AM or later.

This ensures that when they test their system on "future" data, the results are honest and actually work in the real world.

3. The Tools: Mapping the City

To catch the thieves, the team built a map of the city's money flow. They didn't just look at single transactions; they looked at the shape of the network. They used several "structural descriptors" (fancy ways of saying "map features"):

Degree Statistics: How many people is this person talking to? (Are they a loner or a social butterfly?)
PageRank & HITS: Is this person a "Hub" (a central station where money flows through) or an "Authority" (someone everyone trusts)?
K-Core: Is this person part of a tight-knit, exclusive club of suspicious actors?

They calculated these features only using the past, ensuring no cheating.

4. The Experiment: The "Future Test"

They tested their system using a famous dataset called Elliptic (which tracks Bitcoin transactions). They split the data like this:

Training: They taught the computer using data from the past (up to day 34).
Validation: They tweaked the settings using days 35–41.
The Real Test: They asked the computer to predict fraud for days 42 and beyond, which it had never seen before.

The Results:

The "Honest" Score: The model achieved a score of about 0.85 (on a scale where 1.0 is perfect). This is a very strong score, proving that looking at the network structure without cheating actually helps catch fraud.
The "Cheater" Warning: If they had used the "crystal ball" method (looking at the whole graph), the score would have been artificially high and useless in reality.

5. The Twist: The "Main Character" vs. The "Sidekick"

Here is a surprising finding:

The Main Character: The specific details of the transaction (how much money, when, where) were still the most important factor in catching fraud.
The Sidekick: The network features (the map of connections) didn't catch more fraud on their own, but they provided crucial context.

The Analogy: Imagine a suspect is caught.

Transaction Data says: "He bought a plane ticket to a tax haven." (This is the smoking gun).
Graph Data says: "He bought that ticket while sitting in a room with 50 other people who are all buying tickets to the same place." (This explains why it's suspicious and helps the detective understand the bigger picture).

Even if the network data didn't change the final "guilty/not guilty" verdict much, it gave the human investigator a better story to tell and a clearer reason to investigate.

6. The Final Polish: Calibrating the "Risk Meter"

Finally, the paper talks about Probability Calibration.
Sometimes, a computer says, "There is a 90% chance this is fraud." But in reality, it might only be a 50% chance. This is dangerous because investigators might waste time on false alarms.

The authors "calibrated" the model (like tuning a radio) so that when it says "90%," it really means "90%." This makes the risk scores reliable enough for real-world decision-making.

Summary

This paper teaches us that in the fight against financial fraud:

Don't peek at the future. Build models that respect time.
Look at the connections. Even if individual transactions are the main clue, the network map helps explain the "why" and "how."
Be honest about the odds. Make sure your risk scores actually match reality so investigators know when to act.

It's a blueprint for building a fraud detection system that is not just smart, but also trustworthy and ready for the real world.

Here is a detailed technical summary of the paper "Leakage Safe Graph Features for Interpretable Fraud Detection in Temporal Transaction Networks" by Hamideh Khaleghpour and Brett McKinney.

1. Problem Statement

Financial fraud detection increasingly relies on analyzing transaction networks, as malicious behavior often manifests through structural patterns (e.g., central hubs, coordinated neighborhoods) rather than isolated transaction attributes. However, a critical methodological flaw plagues existing graph-based fraud detection systems: look-ahead bias (data leakage).

In temporal transaction networks, standard graph feature extraction often computes metrics (like PageRank or degree centrality) using the entire graph, including edges that occur after the prediction time. This inadvertently allows the model to "see the future," inflating performance metrics and rendering models unreliable for real-world deployment. The paper addresses the need for a causal, time-respecting feature extraction protocol that prevents this leakage while maintaining interpretability for investigative workflows.

2. Methodology

The authors propose an end-to-end pipeline designed to extract graph features strictly from historical data available up to the current timestep.

A. Dataset and Setup

Dataset: The Elliptic dataset, a large-scale cryptocurrency transaction graph. Nodes represent transactions with anonymized feature vectors; edges represent flow. Labels are binary: Licit (0) or Illicit (1).
Temporal Split: To simulate real-world deployment, the data is split chronologically:
- Training: Timesteps $\le$ 34
- Validation: Timesteps 35–41
- Test: Timesteps $\ge$ 42
- Note: This setup ensures the model is evaluated on future, unseen data with a known temporal distribution shift (decreasing illicit rates over time).

B. Causal Graph Feature Extraction

The core innovation is the computation of causal variants of graph features. For any transaction at timestep $t$ , features are computed only on the subgraph $G_{\le t}$ containing edges observed at or before $t$ .

Features Computed:
- Degree Statistics: In-degree, out-degree, total degree.
- Centrality: PageRank, HITS (Hub and Authority) scores.
- Cohesiveness: $k$ -core indices (computed on undirected projections).
- Neighborhood Context: Mean/max neighbor degree, two-hop reachability.
Preprocessing: To handle heavy-tailed distributions common in financial networks, log-transforms ( $\log(1+x)$ ) are applied to degree and reachability features.

C. Modeling Strategy

Algorithm: Random Forest Classifier. Chosen for its ability to handle heterogeneous tabular data, capture non-linear interactions, and provide feature importance for interpretability.
Feature Configurations: Three models were compared:
1. Transaction-only (T): Baseline using only raw transaction attributes.
2. Graph-only (G): Using only the causal structural descriptors.
3. Hybrid (T+G): Combining both.
Training: Class-weighted training was used to address severe class imbalance.

D. Evaluation Metrics

Beyond standard discrimination metrics, the paper emphasizes operational utility:

Discrimination: ROC-AUC and Average Precision (AP).
Operational Triage: Precision at $K$ (performance of top- $K$ ranked alerts) and confusion matrices at specific thresholds.
Reliability: Probability calibration using Calibration Curves and Brier Scores to ensure risk scores reflect true probabilities for decision support.

3. Key Contributions

Leakage-Safe Protocol: Introduction of a rigorous, time-respecting method for graph feature extraction that eliminates look-ahead bias, ensuring evaluation metrics reflect true deployable performance.
Interpretable Feature Suite: A comprehensive set of structural descriptors (PageRank, HITS, $k$ -core, etc.) designed to provide context for investigators, even if they do not drastically boost raw accuracy.
Operational Grounding: Evaluation extends beyond AUC to include threshold-based decision analysis, Precision at $K$ , and probability calibration, aligning the research with real-world investigation constraints.
Empirical Validation: Demonstration that while transaction attributes dominate, graph features provide necessary interpretability, and that strict temporal splits reveal significant performance drops compared to leakage-prone methods.

4. Results

Performance: The Hybrid model (Transaction + Graph) achieved a Test ROC-AUC of 0.853 and Average Precision of 0.537.
- Note: Performance dropped from Validation (ROC-AUC 0.977) to Test (0.853), highlighting the impact of temporal distribution shift and the difficulty of the task.
Feature Ablation:
- Transaction-only: ROC-AUC 0.847.
- Graph-only: ROC-AUC 0.562 (near random guessing).
- Hybrid: ROC-AUC 0.853.
- Finding: Transaction attributes are the dominant predictive signal. Graph features provided negligible quantitative gain in AUC/AP but offered qualitative value.
Interpretability: Graph features (e.g., high centrality, hub/authority scores) provided actionable context for flagged transactions, helping analysts understand why a transaction was risky (e.g., "this node is a central hub in a dense neighborhood").
Calibration: Post-hoc calibration (sigmoid/isotonic) significantly improved the alignment between predicted probabilities and empirical outcomes, making the model's risk scores more reliable for triage workflows.

5. Significance and Conclusion

This paper establishes that causal graph feature extraction is a practical necessity for trustworthy fraud detection. While the study found that handcrafted structural features on the Elliptic dataset did not significantly outperform transaction-level attributes in terms of raw ranking metrics, their value lies in interpretability and risk context.

The work demonstrates that:

Leakage is critical: Ignoring temporal constraints leads to overly optimistic and misleading results.
Context matters: Even if graph features don't boost AUC, they enable analysts to reason about network structures (e.g., coordinated rings, central intermediaries).
Reliability is key: Calibrated probabilities are essential for automated triage and resource allocation.

The authors conclude that their leakage-safe protocol provides a robust foundation for temporal fraud detection pipelines, balancing predictive performance with the transparency required for regulatory and investigative compliance. Future work suggests exploring Temporal Graph Neural Networks (TGNNs) and cost-sensitive decision thresholds.