Comparative e-backtests for general risk measures

Imagine you are the head of a bank's risk department. Every day, you have to predict how much money you might lose in a storm (a market crash). You build a complex computer model to make these predictions. But how do you know if your model is actually good? And more importantly, is it better than the "standard" model the government regulators use?

This paper introduces a new, smarter way to answer those questions using a concept called "E-values" (think of them as Evidence Accumulators).

Here is the breakdown of the paper's ideas using simple analogies:

1. The Problem: The Old Way vs. The New Way

The Old Way (Standard Backtesting):
Imagine you are a teacher grading a student's homework. The old method asks: "Did the student get the right answer?"
If the student's prediction was close to the actual loss, they pass. If not, they fail.

The Flaw: In the real world, we don't know the "perfect" answer in advance. Also, regulators don't just want to know if your model is "okay"; they want to know if it's better than the government's standard model. The old way can't easily compare two models against each other while data is still coming in.

The New Way (Comparative E-Backtesting):
This paper suggests a new game. Instead of asking "Is the answer right?", we ask: "Is Model A beating Model B?"
We use E-values as a scorecard.

The Metaphor: Imagine a Gambling Machine.
- You have two horses: Horse A (your bank's model) and Horse B (the regulator's model).
- Every day, a race happens (the market moves).
- You place a bet on which horse is faster.
- If your horse wins, your "Evidence Pot" (the E-value) grows. If it loses, the pot shrinks.
- The magic of this paper is that you can check the pot at any time. You don't have to wait until the end of the year. If the pot gets huge, you know for sure your horse is winning. If it stays small, you know your horse is losing.

2. The Secret Sauce: "E-Processes" and "Anytime Validity"

In traditional statistics, you have to decide how many races you will run before you start. If you peek at the results halfway through and decide to stop early, you might cheat the math (this is called "p-hacking").

This paper uses E-processes, which are like a Self-Adjusting Compass.

The Analogy: Imagine you are walking through a foggy forest. You have a compass that tells you if you are walking North (towards the truth) or South.
The Superpower: No matter how long you walk, or when you stop to check the compass, it never lies. You can stop after 10 steps or 10,000 steps, and the compass still gives you a valid answer. This is called "Anytime Validity." It means regulators can check your model's performance every single day without worrying about statistical tricks.

3. The "Three-Zone" Traffic Light System

The paper proposes a new way to interpret the results, moving away from a simple "Pass/Fail" to a Traffic Light System:

🟢 Green Light: Your model is clearly beating the regulator's model. (The Evidence Pot for "You are better" is huge).
🔴 Red Light: Your model is clearly worse than the regulator's model. (The Evidence Pot for "You are worse" is huge).
🟡 Yellow Light: It's a tie, or the race is too close to call yet.
- The Innovation: The authors add a clever twist here. Even in the Yellow zone, they look at Speed and Magnitude.
- Speed: Which model started winning faster?
- Magnitude: Which model is winning by a larger margin?
- This helps break ties and gives a clearer answer even when the models are very similar.

4. Handling the Chaos: Structural Changes

Financial markets are messy. Sometimes they are calm, and sometimes they are chaotic (like during a pandemic or a financial crisis).

The Problem: A model might be great in calm weather but terrible in a storm.
The Solution: The paper suggests that when a major event happens (a "structural change"), you can reset the race.
The Analogy: Imagine a marathon. If the weather suddenly changes from sunny to a blizzard, you don't disqualify the runners; you just start a new leg of the race to see who handles the snow better. The paper's method allows regulators to "restart" the E-value counter during crises to see which model adapts best to the new reality.

5. Why This Matters

For Banks: It gives them a fair, real-time way to prove their internal models are working, potentially saving them money on capital reserves if they can prove they are better than the standard.
For Regulators: It provides a robust tool to catch bad models immediately, even if the data is messy or the market is crashing. It stops banks from "gaming" the system by waiting until the end of the year to fix their models.
For Everyone: It moves risk management from "guessing if we are right" to "measuring who is winning right now."

Summary

This paper replaces the old, rigid "Pass/Fail" test with a dynamic, real-time scoreboard that works even when the data is messy or the market is crashing. It uses a "betting" metaphor to accumulate evidence, allowing regulators and banks to see clearly which risk model is the champion, at any moment in time.

Here is a detailed technical summary of the paper "Comparative e-backtests for general risk measures" by Jiao, Wang, and Zhao.

1. Problem Statement

Financial regulation requires banks to validate internal risk models against regulatory benchmarks. While standard backtests evaluate if a single model is statistically consistent with observed losses, they fail to address the regulatory need for comparative backtests: determining whether an internal model performs sufficiently well relative to a prescribed benchmark model.

Existing comparative methods face several limitations:

Model Dependence: Many rely on specific distributional assumptions or asymptotic theory, making them sensitive to model misspecification.
Sequential Validity: Traditional $p$ -value based tests (e.g., Diebold-Mariano, Nolde-Ziegel) often require fixed sample sizes and do not support "anytime-valid" inference (continuous monitoring without inflating Type-I error).
Inconclusive Results: In comparative settings, it is common for neither model to be significantly better than the other, leading to "yellow zone" outcomes where no clear decision can be made.
Scope: Most existing methods focus on Value-at-Risk (VaR) or Expected Shortfall (ES) separately, whereas modern regulation requires joint validation of pairs (e.g., VaR and ES) and other elicitable risk measures like expectiles.

2. Methodology

The authors propose a non-parametric, sequential framework based on e-values and e-processes. This approach allows for continuous monitoring of model performance with guaranteed Type-I error control at any stopping time.

Core Concepts

E-values ( $E$ ): Non-negative random variables such that $\mathbb{E}_Q[E] \leq 1$ under the null hypothesis $Q$ . If an e-value exceeds a threshold $1/\alpha $, the null is rejected at level$ \alpha$.
E-processes ( $M_t$ ): A sequence of e-values forming a supermartingale under the null. They allow for sequential testing where the threshold is checked at any time $t$ .
Elicitability and Identifiability: The framework leverages the fact that many risk measures (Mean, Variance, VaR, ES, Expectiles) are elicitable (minimizable via a scoring function $S$ ) or identifiable (zero-mean via an identification function $V$ ).

Framework Components

A. Standard E-Backtests (Single Model Validation)

Hypothesis: Tests if the reported forecast $R_t$ underestimates the true risk measure $\rho(L_t | \mathcal{F}_{t-1})$ .
Construction: Uses identification functions $V(x, a, b)$ . If the risk measure is identifiable, an e-variable is constructed as $1 + \lambda_t V(L_t, R_t, Z_t)$.
Result: The product of these terms forms an e-process. If the process exceeds $1/\alpha$, the model is rejected for underestimation.

B. Comparative E-Backtests (Internal vs. Benchmark)

Hypothesis: Tests for Conditional S-dominance.
- $H^-_0$ : Internal model $\{R_t\}$ dominates Benchmark $\{R^*_t\}$ (i.e., $E[S(L_t, R_t) - S(L_t, R^*_t) | \mathcal{F}_{t-1}] \leq 0$ ).
- $H^+_0$ : Benchmark dominates Internal.
Construction: Uses scoring functions $S$ . The e-variable is $1 + \lambda_t (S(L_t, R_t) - S(L_t, R^*_t))$.
Handling Boundedness: Since scoring differences may not be uniformly bounded, the authors introduce a betting process $\lambda_t$ and restrict the domain of losses to a compact set $K=[-M, M]$ to ensure the e-variable remains non-negative.

C. Modified Three-Zone Approach
To handle cases where both $H^-_0$ and $H^+_0$ are rejected (or neither is), the authors propose a Weak Dominance criterion:

Red Zone: $H^-_0$ rejected, $H^+_0$ not rejected $\rightarrow$ Internal model fails.
Green Zone: $H^+_0$ rejected, $H^-_0$ not rejected $\rightarrow$ Internal model passes.
Yellow Zone (Both Rejected): The authors introduce Weak Dominance to break the tie:
- Magnitude: Which e-process ( $M^-$ or $M^+$ ) reaches a higher peak?
- Speed: Which e-process crosses the threshold first?
- If one dominates in magnitude or speed, the region shifts to Orange (indicating a tentative preference), providing more informative conclusions than a simple "inconclusive" result.

D. Betting Process Selection
The paper utilizes the GREL (Growth-Rate for Empirical Losses) method to choose the betting parameter $\lambda_t$ . This data-driven approach maximizes the log-growth rate of the e-process, ensuring high statistical power (consistency) as sample size increases.

3. Key Contributions

Model-Free Sequential Framework: Developed a general framework for comparative backtesting applicable to any elicitable or identifiable risk measure (Mean, Variance, VaR, ES, Expectiles, Lambda-VaR) without assuming specific data distributions.
Anytime-Valid Inference: The use of e-processes ensures that Type-I error is controlled at level $\alpha$ regardless of when the test is stopped, accommodating the sequential nature of financial data and regulatory monitoring.
Modified Three-Zone Approach: Introduced the concept of Weak Dominance (magnitude and speed) to resolve ambiguous "yellow zone" outcomes in comparative backtests, offering a more nuanced decision rule for regulators.
Robustness to Structural Changes: The framework supports restarting e-processes at pre-specified points (e.g., structural breaks) or upon rejection, allowing the method to adapt to regime shifts (e.g., financial crises) while maintaining validity.
Technical Characterization: Provided necessary and sufficient conditions for constructing e-variables for standard tests (Theorem 1) and derived the specific forms for common risk measures (Theorem 3, Example 3).

4. Results

The authors validated the methodology through extensive simulations and real-data analysis:

Simulation Studies (IID and Time Series):
- Type-I Error: Empirical rejection rates under the null hypothesis were well-controlled, often lower than the theoretical guarantee, demonstrating the conservativeness and robustness of the method.
- Power: The tests demonstrated high power in detecting underestimation and distinguishing between models with clear dominance relations.
- Structural Changes: In scenarios with AR(1)-GARCH processes and structural breaks (e.g., volatility shifts), the restarting mechanism successfully allowed the e-process to adapt, correctly identifying which model performed better in different regimes (e.g., parametric vs. non-parametric methods).
- Comparison: The e-test approach provided more conclusive results (fewer "yellow" zones) compared to traditional p-value based comparative backtests (e.g., Nolde and Ziegel, 2017).
Real Data Analysis (NASDAQ Composite):
- Applied to VaR, Expectiles, and (ES, VaR) pairs from 2003 to 2025.
- The method successfully identified periods of model dominance during major events like the 2008 Financial Crisis and the COVID-19 pandemic.
- Results showed that model superiority is not static; for instance, a fully parametric model might dominate during stable periods, while a semi-parametric model might dominate during high-volatility crises. The e-processes captured these dynamic shifts in real-time.

5. Significance

This paper represents a significant advancement in financial risk regulation and statistical testing:

Regulatory Impact: It provides a rigorous, model-free tool for regulators to validate internal bank models against benchmarks in real-time, addressing the limitations of current static, p-value-based approaches.
Theoretical Advancement: It bridges the gap between model selection (finding the best model) and regulatory validation (checking if a model is "good enough" relative to a standard), specifically tailored for the asymmetric and sequential nature of financial regulation.
Practical Utility: By handling general risk measures and structural changes, the method is directly applicable to modern risk management practices where models must be robust to misspecification and evolving market conditions.
Inference Quality: The introduction of "Weak Dominance" transforms inconclusive statistical results into actionable insights, allowing regulators to make more informed decisions even when strict statistical dominance is not immediately apparent.

In summary, the authors have established a robust, flexible, and theoretically sound framework for comparative e-backtesting, offering a superior alternative to traditional methods for validating financial risk models in a dynamic, data-rich environment.