ERP-RiskBench: Leakage-Safe Ensemble Learning for Financial Risk

Imagine a massive, bustling city called ERP City. In this city, every purchase, payment, and shipment is recorded in a giant, digital ledger. This is an Enterprise Resource Planning (ERP) system. It's the brain of a company.

But, like any big city, ERP City has a problem: thieves and rule-breakers. Sometimes, people try to steal money, fake invoices, or sneak purchases through the cracks. The company needs a Security Guard to spot these bad actors before they cause damage.

This paper is about building a super-smart, leak-proof Security Guard using Artificial Intelligence (Machine Learning).

Here is the story of how they built it, explained simply:

1. The Problem: The "Leaky Bucket" of Past Research

In the past, researchers tried to build these AI guards, but they made a big mistake. They were like chefs who tasted the soup before serving it to the customers.

The Mistake: They let the AI "peek" at the test answers while it was studying. In technical terms, this is called Data Leakage.
The Result: The AI looked amazing in the lab (99% accuracy!) but failed miserably in the real world because it had cheated during training.
The Goal: This paper says, "Let's stop cheating. Let's build a system that is honest, strict, and actually works."

2. The Solution: The "Leak-Safe" Training Camp

The authors built a new training ground called ERP-RiskBench. Think of it as a rigorous boot camp for AI models.

The Rule of "No Peeking": They used a method called Nested Cross-Validation. Imagine a teacher giving a student a practice test. The student studies, takes the test, and then the teacher grades it. But here, the teacher makes sure the student never sees the real final exam questions until the very end.
Time Travel Rules: In the real world, you can't predict the future. So, the AI is only allowed to learn from the "past" to predict the "future." It can't look at next month's data to guess this month's fraud.
The "Synthetic" City: Since real fraud data is secret (like a bank vault), they built a fake city (Synthetic Data) that looks exactly like the real one. They planted fake thieves in it to train the AI without risking real money.

3. The Team: The "All-Star" Squad

Instead of hiring just one security guard, they built a Stacking Ensemble.

The Analogy: Imagine a detective team. You have a specialist in fingerprints (XGBoost), a specialist in alibis (LightGBM), and a specialist in patterns (Random Forest).
The Coach: They hired a "Coach" (a Meta-Learner) who listens to all the specialists. If the fingerprint guy says "Guilty," but the alibi guy says "Innocent," the Coach weighs the evidence and makes the final call.
The Result: This team worked better than any single detective could on their own.

4. The Metrics: Why "Accuracy" is a Trap

The paper argues that "Accuracy" is a bad scorecard for this job.

The Analogy: Imagine a security guard who sleeps all day and only wakes up to say "No one is stealing." If 99% of people are honest, this guard is 99% accurate! But he's useless because he missed the 1% of thieves.
The Better Score: They used MCC and AUPRC. Think of these as "Did you catch the bad guys without arresting too many innocent people?" It's a much harder, fairer test.

5. The Findings: What Actually Works?

After running thousands of simulations, here is what they discovered:

The Split Matters Most: The most important thing wasn't the fancy AI algorithm; it was how they split the data. If you split data randomly (like shuffling a deck of cards), the AI cheats. If you split it by time (past vs. future), the AI learns the truth.
The "Three-Way Match" is the Smoking Gun: The AI learned that the most common way to catch a thief is looking at Three-Way Matching.
- Analogy: Did the Purchase Order (what we ordered) match the Delivery Receipt (what we got) and the Invoice (what we were billed)? If these three don't line up perfectly, ALARM!
Deep Learning vs. Simple Trees: Fancy, complex "Deep Learning" models (like giant neural networks) didn't beat the simpler, faster "Tree" models. Sometimes, a sturdy oak tree is better than a fragile glass tower.
Explainability: The AI didn't just say "Fraud!" It said, "Fraud! Because the invoice amount was $500 higher than the delivery receipt." This is crucial because human auditors need to know why they are investigating.

6. The Real-World Impact

The paper ends with a blueprint for how to actually use this in a company:

Extract data from the ERP system.
Score transactions with the AI.
Calibrate the AI so it knows the difference between a "maybe" and a "definitely."
Send the top suspects to human auditors with a clear explanation (like a police report).
Learn from the auditors' decisions to get smarter next time.

The Big Takeaway

This paper is a wake-up call. It says: "Stop trying to make AI look good on paper by cheating with data splits. Start building systems that are honest, explainable, and actually catch the bad guys in the real world."

It's not about having the most complex math; it's about having the most disciplined process. Just like a good audit, the process matters more than the tool.

Here is a detailed technical summary of the paper "ERP-RiskBench: Leakage-Safe Ensemble Learning for Financial Risk."

1. Problem Statement

Financial risk detection within Enterprise Resource Planning (ERP) systems is a critical but methodologically flawed area of research. Existing studies suffer from three primary issues:

Vague Data Descriptions: Lack of reproducible, well-documented datasets.
Data Leakage: Preprocessing steps (like imputation, scaling, or resampling) are often applied before data splitting, leading to inflated performance estimates.
Inadequate Evaluation: Reliance on accuracy metrics in highly imbalanced datasets (where fraud is rare) and the use of random splitting that ignores temporal and entity-level dependencies (e.g., the same vendor appearing in both training and test sets).

The paper aims to establish a rigorous, reproducible, and "leakage-safe" framework for detecting procurement compliance anomalies and transactional fraud in ERP environments.

2. Methodology

A. Dataset Strategy: ERP-RiskBench

To address data scarcity and lack of standardization, the authors constructed a composite benchmark called ERP-RiskBench, consisting of four components:

BPI Challenge 2019 (Procurement): Real event logs from a multinational company, labeled via explicit compliance rules (e.g., three-way matching discrepancies).
Credit Card Fraud: A public dataset used to stress-test models under extreme class imbalance (~0.17% fraud rate).
PaySim: Agent-based simulated mobile money data serving as a fraud proxy.
ERP-Synth: A new synthetic dataset generated via a Conditional Tabular GAN (CTGAN) and rule-based typology injection. It simulates specific fraud patterns (e.g., split purchases, duplicate invoices) and allows for controlled stress testing.

Key Innovation: The dataset includes a Scenario Augmented Test Suite (SATS) designed to test robustness against typology shifts, data quality degradation (missingness/noise), and temporal drift.

B. Experimental Protocol: Leakage-Safe Pipeline

The core methodological contribution is a strict Nested Cross-Validation protocol designed to prevent data leakage:

Splitting Strategy: Uses Time-Forward and Group-Aware splitting. Data is ordered chronologically, and the final 20% is held out. Within the training set, all records belonging to the same entity (vendor, user) are kept in a single fold to prevent entity leakage.
Nested Structure:
- Outer Loop (K=5): Unbiased performance estimation.
- Inner Loop (K=3): Hyperparameter tuning, feature selection, and resampling.
Leakage Guardrails: All preprocessing (imputation, encoding, scaling), feature selection, and resampling (SMOTE, CTGAN) are fitted exclusively on the training fold. The validation/test folds receive only transformation operations.

C. Model Suite

The study compares four families of models:

Linear Baselines: Logistic Regression.
Tree Ensembles: Random Forest, XGBoost, LightGBM, CatBoost.
Stacking Ensemble: A meta-learner (Logistic Regression) trained on out-of-fold predictions from XGBoost, LightGBM, CatBoost, and Random Forest.
Deep Tabular Models: TabNet and FT-Transformer.
Glassbox Alternative: Explainable Boosting Machine (EBM) for interpretability.

D. Evaluation Metrics

Primary Metrics: Matthews Correlation Coefficient (MCC), Area Under the Precision-Recall Curve (AUPRC), and Balanced Accuracy.
Cost-Sensitive Analysis: Uses a cost matrix ( $C_{FP}$ vs. $C_{FN}$ ) to derive an optimal decision threshold ( $\tau^*$ ).
Calibration: Platt scaling is applied to ensure probability estimates are reliable for cost-sensitive decision-making.

3. Key Contributions

ERP-RiskBench: A composite, reproducible benchmark combining real event logs, public fraud data, and a novel synthetic ERP dataset with injected risk typologies.
Leakage-Safe Framework: A rigorous experimental pipeline enforcing time-aware and group-aware splitting, ensuring that resampling and feature selection never contaminate test data.
Systematic Ablation Study: A factorial design isolating the impact of splitting strategies, resampling (SMOTE vs. CTGAN), feature selection, and calibration.
Operational Blueprint: A deployment architecture integrating drift monitoring, auditor feedback loops, and cost-sensitive thresholding.

4. Key Results

A. Model Performance

Best Performer: The Stacking Ensemble achieved the highest MCC and AUPRC across all datasets.
Runner-up: LightGBM and XGBoost performed very closely to the stacking ensemble, suggesting that a single well-tuned gradient boosting model is often sufficient for deployment.
Deep Learning: TabNet and FT-Transformer performed comparably to mid-tier tree ensembles on real data but underperformed on synthetic data and showed higher variance.
Glassbox: The EBM achieved performance within 0.03 MCC of the best tree ensemble, offering a viable, fully transparent alternative for audit-critical settings.

B. Impact of Methodological Choices

Splitting Protocol is Critical: Random stratified splitting inflated MCC by 0.08 to 0.12 compared to the time-plus-group protocol. This highlights that many previous studies likely report overly optimistic results due to data leakage.
Resampling: SMOTE provided a significant boost in minority-class recall. CTGAN offered a smaller, incremental gain over SMOTE by capturing non-linear distributional structures.
Calibration: While calibration did not improve ranking metrics (MCC/AUPRC), it was essential for cost-sensitive thresholding. Uncalibrated models (like raw XGBoost) showed overconfidence, leading to suboptimal operational costs.

C. Robustness and Interpretability

Robustness: The stacking ensemble degraded the least under typology shifts and temporal drift. Tree ensembles (LightGBM, CatBoost) handled missing data natively and outperformed deep models in data quality stress tests.
Feature Stability: SHAP-based feature importance was highly stable for tree ensembles and EBM (Spearman correlation > 0.85) but unstable for TabNet (~0.62).
Key Predictors: Three-way matching discrepancies (differences between PO, Goods Receipt, and Invoice amounts) were consistently the most informative features, aligning with domain expertise.

5. Significance and Implications

Reproducibility: The paper provides a complete reproducibility checklist, including seeds, hyperparameter spaces, and software environments, setting a new standard for applied ML research in finance.
Operational Utility: The framework moves beyond "accuracy" to cost-sensitive decision analysis, providing auditors with calibrated probabilities and actionable explanations (SHAP values) to justify flagged transactions.
Governance: The proposed architecture supports regulatory requirements (e.g., NIST AI RMF) by ensuring audit trails, feature stability, and defensible decision criteria.
Practical Advice: The authors conclude that for practitioners, data splitting strategy is more important than model complexity. They recommend using leakage-safe protocols, cost-sensitive thresholds, and stable, interpretable models (like Gradient Boosting or EBM) over complex deep learning architectures for tabular ERP data.

In summary, this paper demonstrates that while ensemble methods yield the best detection rates, the rigor of the experimental pipeline (specifically preventing leakage and using realistic splitting) is the single most influential factor in obtaining trustworthy performance estimates for financial risk detection.