Convergence of Neural Network Policies for Risk--Reward Optimization

Imagine you are the captain of a ship navigating through a stormy sea. Your goal is twofold: you want to reach your destination with as much treasure (reward) as possible, but you also want to avoid sinking or hitting a reef (risk).

This paper is about teaching a computer (specifically, a Neural Network) to be the best possible captain for this journey, even when the rules of the sea are tricky and the map isn't perfectly smooth.

Here is the breakdown of their work using simple analogies:

1. The Problem: The "Two-Step" Dance

In many real-world problems (like managing a retirement fund), you don't just make one decision at a time. You often have to make a two-step move:

Step 1 (The Adjustment): You decide how much money to take out of your pocket (withdrawal) or put in (deposit). This has strict limits—you can't take out more than you have, and you can't take out less than a minimum emergency amount.
Step 2 (The Allocation): Once you've adjusted your cash, you decide how to split the remaining money between different investments (like stocks and bonds). This is like a pie chart where the slices must add up to 100%.

The tricky part? The best strategy often involves sharp turns. For example, if your wealth drops below a certain line, the smart move might be to immediately switch from "spending normally" to "spending the bare minimum." This is a "bang-bang" control: you are either at the max or the min, with very little in between.

2. The Old Way vs. The New Way

The Old Way (Grids): Traditionally, to solve this, mathematicians would draw a giant grid on a map of all possible wealth levels and time periods. They would calculate the best move for every single square on the grid.
- The Flaw: If the map gets too complex (too many variables), the grid becomes so huge it crashes the computer. Also, grids struggle with those "sharp turns" because they are too rigid.
The New Way (Neural Networks): The authors use a Neural Network (a type of AI) to learn the strategy. Instead of a rigid grid, the AI is a flexible, smooth function that can learn the rules.
- The Innovation: They built special "gates" into the AI's output. Think of it like a smart faucet. No matter how hard the AI tries to turn the handle, the faucet is physically designed so the water flow cannot go below zero or above the pipe's capacity. This forces the AI to always obey the rules (constraints) without needing to be told "don't do that" every time.

3. The Big Challenge: "Discontinuous" Moves

The biggest hurdle in this research was proving that this AI method actually works mathematically.

The Issue: Most math proofs assume that if you change your wealth by a tiny bit, your strategy changes by a tiny bit (smoothness). But in our "bang-bang" scenario, a tiny change in wealth might trigger a massive change in strategy (e.g., from "spend $50" to "spend $0").
The Solution: The authors proved that even if the strategy has these sharp "cliffs," the AI can still learn it perfectly, as long as the ship rarely lands exactly on the edge of the cliff.
- The Analogy: Imagine a tightrope walker. If the wind blows the walker exactly onto the edge of the rope, they might fall. But if the wind is random and the walker is usually near the edge but rarely on the exact edge, a smooth approximation (the AI) can still predict the path accurately. The authors proved that in these financial problems, the "wind" (market randomness) ensures you almost never land exactly on the "cliff" where the strategy breaks.

4. The Proof: Does it Converge?

The paper's main achievement is a mathematical guarantee called Convergence in Probability.

What it means: If you give the AI more computing power (a bigger brain) and more practice data (more simulated storms), the AI's performance will get closer and closer to the perfect theoretical strategy.
The Result: They showed that the error doesn't just get small; it gets small reliably. If you run the training 100 times, 99 of those times the AI will find a strategy that is almost as good as the best possible one.

5. The Real-World Test

To prove this wasn't just theory, they tested it on a Retirement Decumulation Problem (how a retiree should spend their savings over 30 years).

The Setup: A retiree has $1 million. They need to withdraw money yearly to live on, but they also need to invest the rest to beat inflation. They want to maximize their spending while ensuring they don't run out of money (risk).
The Outcome:
- The AI learned a strategy that looked almost identical to the "perfect" strategy calculated by the slow, old-fashioned grid method.
- It correctly learned the "bang-bang" behavior: spending the maximum when rich, and the minimum when poor.
- It worked even when tested on new, unseen data (out-of-sample), proving it didn't just "memorize" the practice storms but actually learned how to sail.

Summary

This paper is like building a self-driving car for complex financial decisions.

It handles the rules of the road (constraints) automatically.
It can handle sudden, sharp turns in the road (discontinuous strategies) that usually break other navigation systems.
Most importantly, they proved mathematically that the more you train it, the better it gets, eventually reaching the level of a human expert who has seen every possible road condition.

This opens the door for using AI to solve complex, high-stakes financial problems that were previously too difficult or risky to solve with traditional math.

Here is a detailed technical summary of the paper "Convergence of Neural Network Policies for Risk–Reward Optimization" by Chang Chen and Duy-Minh Dang.

1. Problem Formulation

The paper addresses discrete-intervention stochastic control problems where decisions are made at a finite set of time points, and the system evolves stochastically between these interventions. The core challenge is optimizing a risk–reward objective under pointwise constraints on the control actions.

Two-Step Feedback Policy: The control mechanism is modeled as a pair $P = (q, p)$ $P = (q, p)$ :
- Pre-decision action ( $q$ ): An adjustment (e.g., withdrawal or consumption) applied at time $t_m^-$ , subject to state-dependent interval constraints (e.g., $q \in [q_{min}, \min(q_{max}, w)]$ ).
- Post-decision action ( $p$ ): An allocation (e.g., portfolio weights) applied at time $t_m^+$ , subject to simplex constraints (e.g., $p \in \Delta^{d_a}$ ).
State Dynamics: The state $W(t)$ evolves via update maps $U_q$ and $U_p$ driven by exogenous shocks $Y$ .
Objective Function: The goal is to maximize a scalarized risk–reward criterion:
$V = \sup_{P, \xi} \mathbb{E} [ R(S) + \gamma L(\xi, S, \bar{S}) ]$
- $R(S)$ : Reward function based on a finite-dimensional performance vector $S$ (containing terminal and path-dependent statistics).
- $L(\xi, S, \bar{S})$ : Risk functional represented via an auxiliary variable $\xi$ (e.g., Conditional Value-at-Risk, Buffered Probability of Exceedance).
- $\bar{S}$ : Optional moment dependence (e.g., variance).
- $\gamma$ : Risk aversion parameter.
Key Difficulty: The optimal feedback policies in such constrained problems are often discontinuous (e.g., bang-bang or threshold rules). Standard convergence proofs for neural networks (NNs) typically rely on the global continuity of the target policy, which fails here.

2. Methodology

The authors propose a neural network framework that parameterizes the policy and proves its convergence to the true optimal value under mild conditions.

A. Neural Network Architecture

The policy $P=(q, p)$ is approximated by two coupled feedforward neural networks (FNNs):

Pre-decision Network ( $\hat{q}$ ): Outputs a scalar. To enforce the interval constraint $Z_q(w)$ , the raw network output is passed through a customized map involving a sigmoid function and a state-dependent scaling factor. This ensures feasibility by construction.
Post-decision Network ( $\hat{p}$ ): Outputs a vector. To enforce the simplex constraint $Z_p$ , the raw output is passed through a softmax function.

Result: The optimization over network weights becomes an unconstrained problem, as constraints are hard-coded into the output layers.

B. Convergence Framework

The paper establishes a modular convergence pipeline proving that the empirical optimum of the NN-parametrized problem converges in probability to the true optimal value as network capacity ( $n$ ) and training sample size ( $K$ ) increase. The proof relies on four key steps:

Policy Approximation: Using universal approximation theorems, the NNs can approximate the optimal (potentially discontinuous) controls in probability.
Moving-Input Stability: A critical technical contribution. Instead of requiring global continuity of the optimal policy, the authors assume a "null discontinuity" condition: the optimal policy may be discontinuous, but the probability of the controlled state hitting the discontinuity set is zero. Using the Portmanteau theorem and the extended continuous mapping theorem, they show that NN approximation errors propagate through the state recursion without exploding, even with moving inputs.
Objective Preservation: The convergence of the state sequence implies the convergence of the performance vector $S$ and the risk functional $L$ , preserving the scalarized objective value.
Statistical Consistency: A Uniform Law of Large Numbers (ULLN) is applied to show that the empirical objective (based on finite samples) converges to the true expected objective.

3. Key Contributions

Handling Discontinuous Controls: The paper breaks the reliance on global continuity assumptions. It proves convergence for policies with discontinuities (common in constrained finance problems) provided the discontinuity sets have measure zero under the optimal state distribution.
General Risk-Reward Class: The framework supports a broad class of objectives, including path-dependent statistics, auxiliary-variable risk representations (CVaR, bPoE), and moment-dependent risks (variance, semi-variance).
Modular Convergence Proof: The proof separates approximation, propagation, and objective evaluation, providing a clear theoretical structure for analyzing NN-based control.
Constraint-Enforcing Output Layers: The use of specific output maps (scaled sigmoid, softmax) allows for unconstrained optimization while strictly satisfying state-dependent constraints.

4. Numerical Results

The authors validate the theory using a Defined Contribution (DC) retirement decumulation problem (30-year horizon, 2 assets).

Setup: The retiree maximizes expected cumulative withdrawals subject to a CVaR constraint on terminal wealth. The optimal withdrawal policy exhibits a bang-bang structure (withdraw at min or max limits), creating discontinuities.
Reference: A high-accuracy grid-based method (providing a reference value $V_{ref} \approx 1605.22$ ) was used as a benchmark.
Convergence Findings:
- Capacity: Increasing NN depth/width reduced the approximation error, with the empirical mean converging to $V_{ref}$ .
- Sample Size: Increasing the training dataset size reduced the estimation error (variance), concentrating the results around the optimum.
- Policy Structure: The learned NN policies closely matched the reference "heat maps," successfully capturing the sharp transition boundaries (thresholds) of the bang-bang policy, albeit with slight smoothing due to the continuous nature of NNs.
- Out-of-Sample Robustness: Policies trained on one dataset performed robustly on a large independent test set ( $K_{test} = 2.56 \times 10^6$ ), indicating no significant overfitting.

5. Significance

This work provides a rigorous theoretical foundation for using neural networks in constrained, risk-averse stochastic control problems where traditional dynamic programming is infeasible due to the "curse of dimensionality."

Theoretical Impact: It resolves a major gap in the literature by justifying NN usage for problems with discontinuous optimal policies, a common feature in real-world finance and engineering.
Practical Impact: It offers a reliable, scalable method for solving complex multi-period investment and consumption problems with realistic constraints and risk measures, validated by both theory and extensive numerical experiments.
Future Directions: The authors suggest extending the analysis to time-consistent dynamic risk criteria, relaxing bounded-state assumptions, and handling higher-dimensional action spaces.