Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version

Imagine you are the captain of a ship trying to navigate through a foggy, unpredictable ocean to reach a treasure island. The problem is that the ocean doesn't behave like a normal map; the currents depend on where you've been in the past, not just where you are right now. This is what mathematicians call a "fully non-Markovian" system. It's like trying to predict the weather based on a memory that stretches back infinitely, making it incredibly hard to calculate the best route.

Furthermore, you don't have a perfect map. You know the general rules of the ocean, but you aren't sure about the exact strength of the wind or the current (these are the "unknown model parameters").

This paper presents a brilliant new way to teach a computer (specifically, a Deep Learning AI) how to steer this ship optimally, even when the rules are fuzzy and the ocean has a long memory. Here is the breakdown of their solution using simple analogies:

1. The Problem: The "Memory" Trap

In standard navigation (Markovian), you only need to know your current position to decide your next move. But in this "rough" ocean (like financial markets with "rough volatility" or systems driven by fractional Brownian motion), your next move depends on your entire history.

The Analogy: Imagine trying to predict the next step of a dancer. In a normal dance, you just look at their current pose. In this "rough" dance, you have to remember every single step they took since the music started to guess their next move. This makes calculating the perfect path computationally impossible with traditional methods.

2. The Solution: "Off-Model" Training (The Universal Sandbox)

Usually, to teach an AI to navigate, you would simulate thousands of voyages under a specific set of rules (e.g., "Wind is always 10 knots"). If the wind changes to 12 knots, you have to throw away all your simulations and start over. This is slow and expensive.

The authors propose a "Universal Sandbox" approach:

The Metaphor: Instead of training the AI on a specific ocean, you build a massive, generic "training pool" that covers every possible ocean condition you might encounter. You generate a huge dataset of random waves and currents under a "Reference Law" (a safe, standard simulation).
The Magic Trick: You don't re-simulate the ocean every time your model changes. Instead, you use Importance Sampling. Think of this as a "re-weighting" system.
- Imagine you have a photo album of the ocean taken under "Average Conditions."
- If you suddenly need to navigate a "Stormy Ocean," you don't take new photos. You simply put a filter over the old photos that says, "Treat these waves as if they were 20% bigger."
- This allows the AI to learn from the same dataset, just by adjusting the math (the weights) to fit the new reality.

3. The Adaptive Update: "Warm Starts"

The paper introduces an Adaptive Learning mechanism.

The Old Way: If you realize your map was wrong (e.g., the current is faster than you thought), you fire the AI, delete its brain, and retrain it from scratch with new data. This takes forever.
The New Way (Adaptive): The AI keeps its brain. When the parameters change, you simply update the weights (the filters mentioned above) and give the AI a "warm start." It remembers what it learned about the general structure of the ocean and just tweaks its strategy for the new specific conditions.
The Benefit: This is like a chess player who, instead of relearning the rules of chess every time a new opponent sits down, simply adjusts their strategy based on the opponent's style while keeping their core knowledge intact.

4. Why This Matters (The Real-World Impact)

This isn't just theoretical math; it solves real problems in finance and engineering:

Financial Hedging: In the stock market, prices often behave like "rough" paths (they jump and wiggle in ways that don't fit simple models). This method helps banks calculate the perfect hedge to protect against losses without needing to re-run massive simulations every time market volatility changes.
Model Risk: In the real world, we never know the "true" model of the economy. This method allows systems to adapt quickly as we learn more, separating the error caused by "bad math" (Monte Carlo error) from the error caused by "wrong assumptions" (Model Risk).

Summary

Think of this paper as inventing a universal navigation system for a ship in a foggy, memory-having ocean.

Build a massive, generic training library (Off-Model Training).
Use math filters to instantly adapt that library to any specific weather condition (Importance Sampling).
Update the AI's strategy on the fly without deleting its previous learning (Adaptive Learning).

This makes complex, memory-dependent decision-making fast, scalable, and robust enough for the real world, where nothing is ever perfectly predictable.

1. Problem Statement

The paper addresses continuous-time stochastic control problems where the controlled state processes are fully non-Markovian and depend on unknown model parameters.

Non-Markovian Nature: The state dynamics cannot be reduced to a finite-dimensional Markovian system without adding infinite degrees of freedom. This arises in:
- Path-dependent Stochastic Differential Equations (SDEs).
- Systems driven by Fractional Brownian Motion (fBm).
- Rough stochastic volatility models (where volatility is driven by fBm with Hurst parameter $H < 0.5$ ).
Model Uncertainty: The controller does not have access to a perfectly specified model. The system depends on a deterministic parameter $\theta$ (e.g., drift, volatility coefficients) which is unknown and must be updated as new information becomes available.
Computational Challenge: Traditional Dynamic Programming (DP) fails for these systems because the value function cannot be expressed as a solution to a finite-dimensional PDE. Furthermore, standard Monte Carlo methods become computationally prohibitive when parameters change, as they typically require regenerating entire trajectory datasets and retraining neural networks from scratch for every new parameter estimate.

2. Methodology

The authors propose a Deep Learning Monte Carlo scheme based on a discrete-type skeleton approach (building on prior work [30]) combined with Off-Model Training and Adaptive Importance Sampling.

A. Discrete Skeleton Embedding

The continuous-time problem is approximated by a discrete-time embedded control problem.

The Brownian motion $B$ is discretized using hitting times $T_n = \inf\{t > T_{n-1} : |B(t) - B(T_{n-1})| = \epsilon\}$ .
The controlled state is projected onto this grid, creating a sequence of increments $\Delta X_n$ driven by the hitting times and the Brownian increments.
This transforms the continuous control problem into a finite-horizon backward dynamic programming equation (DPE) solvable via regression.

B. Off-Model Training Architecture

The core innovation is the separation of data generation from model evaluation:

Dominating Training Law ( $\mu$ ): The authors construct a fixed, synthetic reference probability measure $\mu$ (and associated Radon-Nikodym weights $r_j$ ) that dominates the transition laws of the controlled system for a range of possible parameters $\theta \in \Theta$ .
Single Dataset Generation: A single training dataset is generated once under this reference law $\mu$ .
Importance Sampling (IS): To evaluate the DPE for a specific target parameter $\theta$ $θ$ , the algorithm does not regenerate trajectories. Instead, it reweights the existing samples using the IS weights $r_j^\theta$ $r_{j}^{θ}$ .
- The transition kernel $K^\theta$ is recovered via: $K^\theta \approx r^\theta \cdot \mu$ .

C. Adaptive Learning Mechanism

When the parameter estimate $\hat{\theta}$ is updated (e.g., via statistical inference):

No Resampling: The underlying Monte Carlo sample paths remain fixed.
Weight Update: Only the importance sampling weights $r_j$ are recalculated based on the new $\hat{\theta}$ .
Warm-Start: The neural networks (approximating the value function and control policy) are initialized with the weights from the previous iteration, rather than being trained from random initialization. This ensures scalability and rapid convergence.

3. Key Contributions

Explicit Construction of Dominating Measures:
- The paper provides rigorous, explicit constructions of admissible dominating measures $\mu$ $μ$ and Radon-Nikodym derivatives $r_j$ $r_{j}$ for three challenging classes of systems:
  - Path-dependent SDEs driven by Brownian motion.
  - SDEs driven by Fractional Brownian Motion.
  - Rough Stochastic Volatility models (both complete and incomplete markets).
- This is a significant theoretical hurdle, as proving the existence of such dominating measures for non-Markovian, path-dependent systems is non-trivial.
Adaptive Importance Sampling Framework:
- The authors formalize a framework where model risk (parameter uncertainty) is handled by reweighting a single dataset. This decouples the expensive simulation step from the parameter update step.
- They prove that this allows for warm-starting neural networks, making the algorithm scalable under repeated recalibration.
Non-Asymptotic Error Bounds:
- Fixed Parameters: Theorem 4.1 and 4.2 establish non-asymptotic convergence rates for the deep learning approximation of the embedded DPE. The error bounds depend on the number of Monte Carlo samples ( $M$ ), the neural network architecture (width $N$ , depth, etc.), and the approximation error of the neural networks.
- Adaptive Learning: Theorem 4.2 and Proposition 4.1/4.2 decompose the total error into:
  - Monte Carlo Learning Error: Controlled by the sample size and network capacity.
  - Model-Risk Error: Controlled by the distance between the estimated parameter $\theta$ and the true parameter $\theta^*$ .
- This separation allows for quantitative control of the trade-off between estimation accuracy and model misspecification.

4. Key Results

Convergence: The proposed scheme converges to the near-optimal control of the original continuous-time problem as the discretization level $\epsilon \to 0$ and the number of Monte Carlo samples $M \to \infty$ .
Stability: The value functionals are shown to be Lipschitz continuous with respect to the model parameters $\theta$ . This stability is crucial for the adaptive scheme, ensuring that small changes in $\theta$ lead to small changes in the value function, justifying the use of warm-starts.
Numerical Experiments:
- Mean-Variance Hedging (Rough Volatility): The method successfully learns hedging strategies for options under rough volatility ( $H \approx 0.1$ ). Experiments show that refining the discretization level significantly reduces the variance of the hedging error (P&L).
- Adaptive Update Efficiency: In a linear-quadratic example with model risk, the "Fast IS" mode (reweighting + warm-start) achieved comparable or better performance than "Scratch" (retraining from scratch) in roughly half the computational time (Speedup $\approx 2.0\times$ ).
- Off-Policy Training: The experiments demonstrate that the method is robust to the choice of the exploration radius in the training distribution, provided it is neither too narrow nor too wide.

5. Significance

Bridging Theory and Practice: The paper bridges the gap between theoretical non-Markovian control (often intractable) and practical deep learning applications. It provides a mathematically sound way to apply Deep Reinforcement Learning to systems with memory (like rough volatility).
Computational Efficiency: By eliminating the need to regenerate trajectories for every parameter update, the method makes real-time adaptive control feasible for complex financial models where parameters drift or are uncertain.
Model Risk Management: The framework offers a structured way to handle model risk. Instead of assuming a single fixed model, it allows for continuous adaptation and quantification of the error introduced by parameter uncertainty.
Generalizability: While focused on stochastic control, the concept of using a dominating measure for off-policy learning and adaptive reweighting has potential applications in broader areas of machine learning, particularly in scenarios involving distributional shifts or online learning.

In summary, this paper presents a robust, scalable, and theoretically grounded methodology for solving high-dimensional, non-Markovian stochastic control problems under model uncertainty, leveraging the synergy between discrete skeleton approximations, importance sampling, and deep neural networks.