A market resilient data-driven approach to option pricing

Imagine you are trying to predict the price of a warranty for a car. If you own a Ferrari, the warranty costs more than for a Toyota, not just because the car is expensive, but because the risk of something breaking is different. In the financial world, these "warranties" are called options. They are contracts that let you buy or sell a stock at a specific price later on.

Figuring out the "fair price" for these options is a huge puzzle. For decades, mathematicians have used complex formulas (like the famous Black-Scholes model) to solve it. But these formulas assume the market behaves in a perfect, predictable way. Real life is messy, chaotic, and full of surprises (like a pandemic or a sudden crash).

This paper proposes a new way to solve the puzzle: Let the data teach us, but give it a little theoretical nudge.

Here is the breakdown of their idea using simple analogies:

1. The Old Way: The "Same Shape" Rule (Homogeneity Hint)

Imagine you have a recipe for a perfect chocolate cake. You know that if you double the ingredients, you get a cake that is exactly twice as big, but it tastes the same. This is called homogeneity.

The authors looked at a rule in finance that says: If two stocks have similar "flavor profiles" (statistical patterns of how their prices move), you can use the pricing data from one to predict the price of the other.

The Problem: This works great if you are comparing two similar cars (like a Honda and a Toyota). But what if you try to use the Honda data to price a Ferrari? The "flavor profiles" are too different. The model gets confused and fails. This is called a Domain Shift.

2. The New Idea: The "Universal Translator" (Common Representation Space)

The authors realized that while a Honda and a Ferrari drive differently, they both follow the laws of physics. They just need a translator to speak the same language.

They invented a "translator" called the Volatility Scalar.

The Analogy: Imagine you are trying to compare the speed of a snail and a cheetah. If you just look at "meters per second," the numbers are wildly different. But if you measure them in "body lengths per second," they might actually be moving at a similar relative speed.
The Solution: The authors created a mathematical "body length" for stocks. They take the chaotic, different price movements of two different stocks (like the NIFTY 50 index and the Bank Nifty index) and scale them down so they look the same.

Once the data is scaled, a machine learning model can learn the "universal rules" of option pricing from one stock and apply them to another, even if they are totally different.

3. The "Smart Switch" (The Ensemble Model)

The authors didn't just stop at the translator. They built a Smart Switch (an Ensemble Model) that decides which method to use based on the weather.

Normal Days (Typical Data): If the market is calm and behaving like usual, the model trusts the old "Same Shape" rule (Homogeneity Hint). It's simple and works well.
Stormy Days (Atypical Data): If the market is crashing or acting weird (like during the COVID-19 lockdown), the "Same Shape" rule breaks. The model detects this "storm" (using a metric they call the Domain Shift Quotient) and flips the switch to the "Universal Translator" (Domain Shift approach).

Think of it like driving a car:

On a smooth highway, you use Cruise Control (the old method).
When you hit a bumpy, off-road trail, you switch to 4-Wheel Drive (the new method).
The Ensemble Model is the driver who knows exactly when to switch gears so you never get stuck.

4. The Results: Why This Matters

The authors tested this on real Indian stock market data, including the chaotic period of the 2020 pandemic.

The Old Models: Got confused during the pandemic and made big errors.
The New Model: Adapted quickly. By using the "Universal Translator," it could learn from one part of the market and accurately predict prices in another, even when the market was behaving strangely.
The "Super Model": Their final "Smart Switch" model was the best of all. It was accurate on calm days and resilient during the storm.

The Big Takeaway

This paper is a bridge between Math Theory and Data Science.

Data Science says: "Throw all the data at a computer and let it guess."
Math Theory says: "Follow the strict rules of physics."

The authors say: "Let's use the strict rules to organize the data, so the computer can guess better."

They proved that by understanding why markets move (using the "Volatility Scalar"), we can build AI models that don't just memorize the past, but can actually survive the future, even when the future looks nothing like the past.

Here is a detailed technical summary of the paper "A Market Resilient Data-Driven Approach to Option Pricing" by Anindya Goswami and Nimit Rana.

1. Problem Statement

Option pricing is a central problem in mathematical finance. While traditional methods rely on stochastic models (e.g., Black-Scholes-Merton) to derive theoretical prices, data-driven approaches have emerged that rely solely on observed market data without assuming a specific asset dynamics model.

However, existing data-driven models face a critical limitation: Domain Shift. Most models trained on one asset class (e.g., NIFTY 50) fail to generalize to another (e.g., NIFTY Bank) or to the same asset during periods of extreme market volatility (e.g., the COVID-19 crash). This occurs because standard data-driven approaches assume that the distribution of log-returns is identical across assets or time periods, an assumption that rarely holds in real-world markets with significant regime changes.

The paper aims to develop a market-resilient data-driven framework that can:

Bridge the gap between different risk-neutral return distributions of different assets.
Maintain accuracy during "atypical" market conditions (high volatility/domain shifts).
Provide a theoretical foundation for domain adaptation in option pricing.

2. Methodology

The authors propose a three-pronged approach combining theoretical derivations with machine learning (XGBoost).

A. Theoretical Framework: Homogeneity vs. Domain Adaptation

Homogeneity Hint Approach (AHH):
- Based on Theorem 2.2, which states that if two assets have identical conditional laws of log-returns under their respective risk-neutral measures, their normalized option prices (Option Price / Spot Price) are equal for the same moneyness and time-to-maturity.
- Limitation: This requires the return distributions to be identical, which is too restrictive for different assets or volatile periods.
Domain Shift (DS) Approach (ADS):
- To handle different assets with different return distributions, the authors introduce a Volatility Scalar ( $\rho$ ).
- Definition: $\rho$ represents the root-mean-square of the volatility over the option's life. It acts as a scaling factor to normalize the asset price process.
- Mechanism: By scaling the asset price $S$ to $A = S^{1/\rho}$ , the authors derive Theorem 2.6, which proves that the risk-neutral distributions of the log-returns of these scaled assets can be made identical even if the original assets differ.
- Approximation: Since exact equality is hard to find, the authors use an Implied Volatility (IV) approximation formula (based on Brenner & Subrahmanyam) to construct a Common Representation Space. They define a target variable $U$ (derived from IV and $\rho$ ) that is invariant across assets with different volatilities, provided they share the same moneyness and time-to-maturity.

B. The Ensemble Model (AE)

Recognizing that AHH works well in "typical" markets (where distributions are stable) and ADS works well in "atypical" markets (where distributions shift), the authors propose an Ensemble Model (AE).

Domain Shift Quotient (DSQ): A metric defined as $|\sigma_i - \sigma_0| / \sigma_0$ , where $\sigma_i$ is the current volatility and $\sigma_0$ is the historical training volatility.
Weighting: The ensemble dynamically weights the predictions of AHH and ADS based on the DSQ.
- Low DSQ (Typical market) $\rightarrow$ Higher weight to AHH.
- High DSQ (Atypical market) $\rightarrow$ Higher weight to ADS.

C. Implementation Details

Algorithm: XGBoost (Extreme Gradient Boosting) for supervised regression.
Features:
- 19 order statistics of centered daily log-returns (capturing the distribution shape).
- Moneyness ( $K/S$ ), Time-to-Maturity (TTM), Risk-free rate.
- Normalized previous option price.
Target Variables:
- For AHH: Normalized Price ( $C/S$ ).
- For ADS: The invariant variable $U$ derived from the IV approximation.
Data: Daily European call option data from the National Stock Exchange (NSE) of India for NIFTY 50 and NIFTY Bank indices (2015–2020).
Test Scenarios:
- Typical: Sept–Dec 2019 (stable market).
- Atypical: Jan–Apr 2020 (COVID-19 crash).

3. Key Contributions

Theoretical Proof of Domain Adaptation: The paper provides a rigorous theoretical derivation (Theorem 2.6) showing how to construct a common representation space for option pricing across assets with different risk-neutral distributions using a volatility scalar.
Novel Ensemble Strategy: Introduction of the Domain Shift Quotient (DSQ) to dynamically switch between a homogeneity-based model and a domain-adaptation model, optimizing performance across varying market regimes.
Market Resilience: Demonstration that data-driven models can be made robust to extreme market shocks (like the 2020 crash) by explicitly modeling domain shifts rather than ignoring them.
Multi-Source Training: Evidence that training on combined data from multiple indices (NIFTY 50 + NIFTY Bank) improves generalization, allowing models to predict prices for assets with limited historical data.

4. Results

The study evaluated three models (AHH, ADS, AE) against a Black-Scholes-Merton (BSM) benchmark and across four test sets (Typical/Atypical for NIFTY 50 and NIFTY Bank).

Benchmark Comparison: All data-driven models significantly outperformed the BSM benchmark, reducing RMSE by more than 50% in most cases.
Typical Market Performance:
- AHH outperformed ADS when the test data distribution matched the training data (same asset, stable period). This validates the theoretical homogeneity assumption in stable conditions.
Atypical Market Performance (COVID-19):
- ADS significantly outperformed AHH. AHH failed to generalize to the high-volatility regime, while ADS maintained accuracy due to the domain adaptation mechanism.
- Ensemble (AE) achieved the lowest RMSE across all test sets, successfully balancing the strengths of both approaches.
Multi-Source Training:
- Models trained on combined data (N50 + BNF) performed consistently well, often outperforming single-source cross-symbol predictions.
- The ensemble model trained on multi-source data achieved the best overall results, with RMSE within 7% of the best possible single model for any given test set.
Synthetic Data Validation: Experiments using simulated Geometric Brownian Motion with varying volatilities confirmed that ADS is less sensitive to volatility shifts than AHH, and the Ensemble model provides the most stable performance curve.

5. Significance

Bridging Theory and Practice: The paper successfully translates abstract domain adaptation concepts from machine learning into the language of stochastic finance, providing a theoretical justification for why certain data-driven models fail and how to fix them.
Interpretability: Unlike "black box" deep learning models, this approach retains interpretability by using financial variables (volatility scalar, moneyness) and avoiding macroeconomic features.
Practical Application: The proposed ensemble model offers a robust solution for traders and risk managers who need accurate pricing models that do not break down during market crashes or when applied to new, data-scarce assets.
Future Direction: It establishes a foundation for "universal" option pricing models that are not specific to a single asset class, potentially revolutionizing how derivatives are priced in diverse and volatile global markets.