Transfer Learning for Loan Recovery Prediction under Distribution Shifts with Heterogeneous Feature Spaces

Imagine you are a bank manager trying to predict how much money you will get back if a borrower defaults (stops paying). This is called the Recovery Rate.

Usually, you'd look at your own bank's history to make this prediction. But here's the problem: You don't have enough data. Defaults are rare events. It's like trying to learn how to fly a plane by watching only two crash videos. You need more examples to learn the rules.

So, you decide to borrow knowledge from a bigger, richer bank (the "Source") that has thousands of default stories. This is called Transfer Learning.

However, there's a catch:

The Data is Different: The big bank tracks 100 different details about loans (like collateral type, industry, etc.), while your bank only tracks 30. Some details your bank has, the big bank doesn't.
The Rules Might Have Changed: The big bank deals mostly with secured loans (backed by houses), while you deal with unsecured bonds. The "rules" of how much money is recovered might be different.

This paper introduces a new AI tool called FT–MDN–Transformer to solve these exact problems. Here is how it works, explained simply:

1. The "Universal Translator" (Handling Different Features)

Imagine the big bank writes its stories in English, and your bank writes in French. Most AI models can't read both; they need the exact same words.

This new model acts like a Universal Translator.

It treats every piece of information (like "loan amount" or "industry") as a separate "token" (a word).
If the big bank mentions "Collateral Type A" but you don't have that category, the model just puts a "mask" over it and ignores it, rather than crashing.
If you have a new category the big bank never saw, the model learns it on the fly while keeping the knowledge it already has.
The Analogy: It's like a chef who learned to cook with a specific set of spices in a big kitchen. When they move to a small kitchen with a different spice rack, they don't throw away their skills. They use what they have, ignore the missing spices, and learn the new ones without forgetting the old ones.

2. The "Weather Forecaster" (Predicting Distributions, Not Just Numbers)

Most AI models try to give you a single number: "You will recover 60% of the money." This is like a weather app saying, "It will be 72°F."

But in reality, recovery rates are chaotic. Sometimes you get 0%, sometimes 100%, and rarely 50%. The distribution is "bimodal" (two peaks).

This new model doesn't just guess a number. It acts like a Weather Forecaster.
Instead of saying "It will be 72°F," it says: "There is a 40% chance it will be freezing (0% recovery), a 40% chance it will be hot (100% recovery), and a 20% chance it will be mild."
Why this matters: For a bank, knowing the risk of a total loss (the freezing scenario) is more important than knowing the average temperature. This model gives you the full picture of the risk, not just a single, potentially misleading average.

3. The "Student and Mentor" (How the Learning Works)

The model uses a two-step training process:

Pre-training (The Mentor): The model studies the massive dataset from the big bank first. It learns general patterns about how loans work.
Fine-tuning (The Student): The model then moves to your small bank. It takes what it learned from the big bank and "fine-tunes" it using your limited data.

The Results:

When data is scarce: This method is a lifesaver. It learns much faster and more accurately than trying to learn from scratch with your tiny dataset.
When the "Rules" change slightly: It handles it well. If the big bank's data is slightly different (e.g., different interest rates), the model adapts.
When the "Rules" change completely: If the big bank's recovery patterns are totally different from yours (e.g., they deal with houses, you deal with bonds), the model struggles. You can't teach a fish to fly just because it knows how to swim. The paper calls this a "Label Shift," and it's the hardest challenge.

The Big Takeaway

This paper proves that you can use AI to learn from other banks' data even if your data looks different, as long as you use the right tools.

Old way: "We can't use that data because our columns don't match."
New way (FT–MDN–Transformer): "We can use that data! We'll ignore the columns we don't have, learn the new ones, and give you a full risk profile instead of just a guess."

It's a powerful step forward for banks that are small or specialized, allowing them to leverage the collective wisdom of the entire financial world to manage risk better.

1. Problem Statement

The paper addresses two critical challenges in modeling Recovery Rates (RR) (or Loss Given Default, LGD) for credit risk management:

Data Scarcity: RR is only observed upon default, which is a rare event. Many specialized loan portfolios (e.g., shipping finance, niche SMEs) have insufficient default data to train robust models.
Heterogeneity and Distribution Shifts:
- Feature Heterogeneity: Source and target portfolios often record different variables (e.g., different collateral details or internal classifications), leading to non-identical feature schemas.
- Distribution Shifts: Even when features overlap, the statistical relationship between features and recovery outcomes changes across portfolios due to differences in loan products, macroeconomic conditions, and borrower composition. These shifts include covariate shift (feature distribution changes), conditional shift (relationship between features and label changes), and label shift (marginal distribution of recovery rates changes).

Existing Transfer Learning (TL) approaches typically assume identical feature spaces and focus on point estimates, failing to capture the complex, multi-modal nature of recovery distributions or handle schema mismatches effectively.

2. Methodology: FT–MDN–Transformer

The authors propose FT–MDN–Transformer, a novel architecture designed specifically for TL in RR forecasting under heterogeneous conditions.

A. Architecture Components

Feature-wise Tokenization:
- Instead of concatenating features into a single vector, each feature is mapped to an individual token in a fixed-length sequence.
- Handling Heterogeneity: The model uses masking to handle missing features. If a feature exists in the source but not the target (or vice versa), it is represented by a learned [PAD] token and masked out during attention. This allows the model to operate on partial feature sets without altering the input geometry.
- Categorical Embeddings: Categorical variables are handled via learned embedding tables. The vocabulary is constructed from the union of source and target categories, allowing the model to handle new categories in the target domain without retraining the entire embedding layer from scratch.
Transformer Backbone:
- Utilizes a standard Transformer encoder (based on FT-Transformer) with multi-head self-attention.
- The attention mechanism is masked to ignore [PAD] tokens, ensuring that the model learns interactions only among available features while preserving the positional structure learned during pre-training.
Mixture-Density Network (MDN) Head:
- Unlike standard regression heads that output a single point estimate, the MDN head outputs the parameters of a Gaussian Mixture Model: $\hat{p}(R|X) = \sum \alpha_k \mathcal{N}(\mu_k, \sigma_k^2)$ .
- This allows the model to capture multi-modality (e.g., distinct regimes for secured vs. unsecured loans) and heteroscedasticity, providing a full conditional probability distribution rather than just a mean.

B. Transfer Learning Strategy

The model employs a two-stage training schedule:

Pre-training: Trained on a rich source domain (e.g., GCD loan data). Two strategies are explored:
- Shared-only: Pre-training only on features common to both domains.
- Full-source: Pre-training on all source features, masking those absent in the target during fine-tuning.
Fine-tuning: The model is adapted to the target domain (e.g., UP5 bond data).
- Shared feature embeddings are initialized from pre-training.
- Target-specific features are initialized randomly and trained.
- A "warm-up" phase freezes early layers to stabilize the transfer before joint optimization.

3. Key Contributions

Novel Architecture: Introduction of FT–MDN–Transformer, the first tabular Transformer designed to simultaneously handle heterogeneous feature spaces (via token masking) and distributional forecasting (via MDN) in a transfer learning setting.
Empirical Validation: A comprehensive study using real-world data (Global Credit Data as source, UP5 bonds as target) demonstrating that the model can transfer knowledge across portfolios with only ~23% feature overlap (37 shared features out of ~300).
Controlled Simulation Framework: Development of a Monte Carlo simulation framework that generates synthetic loan data with controlled covariate, conditional, and label shifts. This allows for the isolation of specific drift mechanisms to test model robustness, a capability rarely available in real-world credit data.
Distributional Insights: Demonstration that modeling the full recovery distribution is superior to point estimates for capturing tail risks and multi-modal recovery behaviors, which are critical for regulatory capital and stress testing.

4. Key Results

Real-World Data (GCD $\to$ UP5)

Performance: FT–MDN–Transformer significantly outperformed baselines (XGBoost, Random Forest, MLP, and a deterministic Transformer variant) when target data was scarce ( $N < 500$ ).
Heterogeneity: The model successfully transferred knowledge despite massive feature mismatches. Pre-training on the shared subset of features proved more effective than pre-training on the full source schema, as features present only in the source could bias the model if they disappeared during fine-tuning.
Distributional Accuracy: The model accurately reproduced the bimodal nature of recovery rates (peaks near 0 and 1), whereas point-estimate models collapsed these modes into a single average, obscuring tail risk.

Simulation Study

Robustness to Shift Types:
- Covariate & Conditional Shifts: The model remained highly robust, maintaining performance gains even as the intensity of these shifts increased.
- Label Shift: This was identified as the most challenging scenario. When the marginal distribution of recovery rates changed significantly (e.g., the proportion of high-recovery vs. low-recovery loans shifted), the benefits of transfer learning diminished, and all models struggled.
Sample Efficiency: Transfer learning provided the largest gains in the data-scarce regime ( $N < 300$ ). As target sample sizes increased, the gap between transfer learning and training from scratch narrowed.
Feature Mismatch: Baselines like XGBoost suffered catastrophic performance drops when features present in pre-training were absent in fine-tuning. FT–MDN–Transformer remained stable across all overlap regimes (Equal, Source $\subset$ Target, Target $\subset$ Source).

5. Significance and Implications

Practical Risk Management: The study provides a viable solution for banks managing niche portfolios with limited default data. By leveraging data from larger, related portfolios, institutions can build more accurate RR models without waiting for sufficient internal default history.
Regulatory Compliance: The ability to output full probability distributions (rather than just point estimates) aligns better with regulatory requirements (e.g., IFRS 9, Basel) that demand an understanding of tail risk and uncertainty.
Methodological Advancement: The paper establishes that schema-aware transfer learning is feasible. It proves that models can adapt to changing feature sets without manual feature engineering or mapping, provided they use token-level masking and embedding strategies.
Boundary Conditions: The research highlights a critical limitation: Transfer Learning is effective only when the label distribution (recovery rate distribution) remains relatively aligned. If the fundamental nature of the recovery process changes (e.g., a shift from secured to unsecured dominance), transfer learning offers limited benefit, necessitating domain adaptation or retraining.

In summary, the paper demonstrates that FT–MDN–Transformer is a robust tool for credit risk modeling in data-scarce, heterogeneous environments, provided that the underlying recovery distributions are not fundamentally divergent.

Transfer Learning for Loan Recovery Prediction under Distribution Shifts with Heterogeneous Feature Spaces

1. The "Universal Translator" (Handling Different Features)

2. The "Weather Forecaster" (Predicting Distributions, Not Just Numbers)

3. The "Student and Mentor" (How the Learning Works)

The Big Takeaway

1. Problem Statement

2. Methodology: FT–MDN–Transformer

A. Architecture Components

B. Transfer Learning Strategy

3. Key Contributions

4. Key Results

Real-World Data (GCD →\to→ UP5)

Simulation Study

5. Significance and Implications

More like this

Financial Anomaly Detection for the Canadian Market

On options-driven realized volatility forecasting: Information gains via rough volatility model

When cooperation is beneficial to all agents

Debiasing LLMs by Fine-tuning

YC Bench: a Live Benchmark for Forecasting Startup Outperformance in Y Combinator Batches

Real-World Data (GCD $\to$ UP5)