From metric to action: The decision value of infectious disease forecasts

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are the captain of a ship steering through a thick, unpredictable fog. You have a weather forecast telling you that a storm might hit in three days. But here's the catch: the forecast is just a probability, not a guarantee. Do you change course now? Do you drop anchor? Do you keep sailing?

This is the daily reality for public health officials during an epidemic. They have to make life-or-death decisions (like closing schools or expanding hospitals) based on forecasts that are often messy, delayed, and uncertain.

For a long time, scientists who built these disease forecasts were like weathermen who only cared about their own scorecard. They asked: "Did my math match the actual weather perfectly?" They used complex statistics to see if their predictions were "calibrated" or "sharp."

This paper argues that this is the wrong question.

The authors, a team of statisticians and epidemiologists, say: "It doesn't matter if your math is perfect if it doesn't help the captain steer the ship."

Here is the paper's new approach, explained simply:

1. The "Scorecard" Problem

Imagine two weather forecasters.

Forecaster A is a genius at predicting the average temperature for the whole month. Their math is perfect.
Forecaster B is terrible at averages but is amazing at predicting when a sudden, dangerous frost will kill the crops.

If you are a farmer, Forecaster B is infinitely more valuable, even if their "average" score is lower.

The paper argues that current ways of judging disease forecasts are like only looking at Forecaster A's average score. They miss the fact that a decision-maker (like a hospital director) might only care about the "frost" (a sudden spike in patients). If a model is good at predicting the average but misses the spike, it's useless for the decision-maker, even if the math looks "good."

2. The New Framework: "The Decision-Maker's Menu"

The authors propose a new way to evaluate forecasts, which they call a Decision-Value Framework. Instead of asking "Is this model statistically accurate?", they ask: "How much money, lives, or time does this model save a specific decision-maker?"

They introduce a few key concepts using simple metaphors:

The Cost-Loss Ratio (The Price of Being Wrong):
Imagine you have to decide whether to buy an umbrella.
- Cost of Action: The umbrella costs $10.
- Loss if you don't act: If it rains and you don't have an umbrella, you get soaked and your suit is ruined (worth $100).
- The Decision: If the forecast says there's even a 10% chance of rain, you should buy the umbrella. The "value" of the forecast depends on how much you hate getting wet versus how much you hate spending $10.
- In the paper: Different decision-makers have different "Cost-Loss Ratios." A hospital with plenty of beds might only act if the risk is 90%. A hospital with no beds might act if the risk is 10%. The paper's framework tests models based on these specific preferences, not just a generic "average."
Murphy Diagrams (The "What-If" Map):
Think of this as a map that shows you exactly where a model shines and where it fails.
- Instead of giving you one single number (like "85% accurate"), it draws a graph.
- One side of the graph asks: "How well does this model predict a small outbreak?"
- The other side asks: "How well does it predict a massive, deadly wave?"
- This helps a decision-maker see: "Oh, Model X is great for small waves, but Model Y is the only one that warns us about the massive waves."
Predictability (The Fog Meter):
Sometimes, the fog is just too thick. The disease is changing so fast (new variants, people changing their behavior) that no one can predict it well.
- The paper suggests measuring this "fog." If the fog is thick (low predictability), even the best model might fail.
- This is a safety check. It tells decision-makers: "Hey, the system is chaotic right now. Don't trust any model too much; be extra cautious."

3. The Real-World Test: COVID-19

The authors tested their new system using real data from the COVID-19 pandemic in the US.

They looked at forecasts for weekly cases.
They found that the "Ensemble" model (a team of many models voting together) was usually the best overall.
However, when they looked at specific decision-makers with specific fears (e.g., "I need to know if cases will exceed 10,000 next week"), sometimes a different, simpler model was actually more useful.
This proves that there is no "one-size-fits-all" best model. The "best" model depends entirely on who is asking the question and what they are willing to risk.

The Big Takeaway

The paper is a call to action for scientists and politicians to talk to each other before the forecast is made.

Old Way: Scientists build a model, give it a math score, and hand it to a politician. The politician tries to guess what to do with it.
New Way: Politicians say, "I need to know if we will run out of ICU beds next week, and I am willing to spend $1 million to avoid that risk." Scientists then build and test models specifically to answer that question.

In short: A forecast isn't a crystal ball; it's a tool. Just like you wouldn't use a hammer to screw in a lightbulb, you shouldn't use a "statistically perfect" forecast to make a decision that requires a "risk-averse" forecast. This paper gives us the instructions on how to pick the right tool for the job.

1. Problem Statement

Infectious disease forecasting has advanced significantly, particularly during the COVID-19 pandemic, with a focus on improving statistical metrics like calibration and sharpness (e.g., CRPS, WIS). However, a critical gap exists between forecast quality (statistical accuracy) and forecast value (utility for decision-making).

The Gap: Current evaluation protocols often prioritize the forecaster's perspective (minimizing statistical error) rather than the decision-maker's perspective (maximizing operational benefit).
The Challenge: Public health decisions are made under high uncertainty, societal pressure, and varying risk appetites. A model that performs best "on average" may not be optimal for specific decisions (e.g., allocating ICU beds) or specific risk profiles (e.g., extreme tail risks).
The Need: There is a lack of systematic frameworks to translate probabilistic forecasts into actionable insights that account for the cost of action versus the cost of inaction (Cost-Loss ratios) for specific decision-makers.

2. Methodology: A Decision-Theoretic Framework

The authors propose a systematic evaluation framework that shifts the focus from the forecaster to the decision-maker. The framework integrates concepts from weather forecasting, information theory, and decision theory.

A. Core Workflow (Figure 1)

The workflow is iterative and user-centric:

Co-definition: Define specific policy questions and targets with local decision-makers.
Metric Selection: Identify appropriate evaluation metrics based on user preferences and risk tolerance.
Forecasting: Generate probabilistic forecasts.
Evaluation: Measure performance across space, time, and statistical properties.
Predictability Assessment: Analyze the inherent predictability of the epidemic to anticipate future reliability.
Decision Support: Select the "optimal" model based on the specific user-event combination.

B. Key Technical Components

1. Cost-Loss (C/L) Decision Framework
The framework maps statistical scores to economic decision theory.

Elementary Score: Defined by a binary event threshold ( $R$ ) and a Cost-Loss ratio ( $S = P/Q$ , where $P$ is the cost of action and $Q$ is the preventable loss).
Decision Rule: A decision-maker acts if the forecasted probability of the event exceeds the C/L ratio ( $P(L > R) > S$ ).
Implication: Popular scoring rules (like CRPS) are shown to be weighted integrals of these elementary scores, implicitly assuming specific risk preferences. The framework makes these assumptions explicit.

2. New Evaluation Metrics
The authors introduce and utilize specific metrics to quantify decision value:

Relative Economic Value (REV): Measures the operational value of a model relative to a baseline and a perfect model for a specific C/L ratio and event threshold.
Murphy Diagrams: Visual tools that decompose forecast value:
- User-specific: Plots mean elementary score against varying event thresholds for a fixed risk preference.
- Event-specific: Plots mean elementary score against varying risk preferences for a fixed event threshold.
CORP Decomposition: Decomposes scoring rules (like Pinball Loss or Brier Score) into three interpretable components:
- MCB (Miscalibration): Reliability of the forecast.
- DSC (Discrimination): Ability to distinguish between outcomes.
- UNC (Uncertainty): Irreducible variability in the data.

3. Predictability Assessment (Permutation Entropy)
To safeguard against data distribution shifts, the framework incorporates Permutation Entropy (PE) to measure the inherent predictability of the epidemic time series.

Definition: $f = 1 - \min(b_{d,\tau}(\Omega))$ , where $b$ is normalized Shannon entropy of ordinal patterns.
Usage: Low predictability ( $f \approx 0$ ) suggests high randomness; high predictability ( $f \approx 1$ ) suggests strong patterns. This helps determine if a model's past performance was due to skill or a highly predictable epidemic regime.

3. Key Contributions

Paradigm Shift: Moves the evaluation standard from "best average statistical fit" to "best decision value for a specific user and event."
Explicit Risk Integration: Provides a mathematical bridge between probabilistic forecasts and decision-maker risk preferences (C/L ratios) and policy thresholds.
Diagnostic Toolkit: Introduces decomposable metrics (CORP) and visual tools (Murphy diagrams, REV curves) that allow forecasters to diagnose why a model fails for a specific decision (e.g., poor calibration at high quantiles vs. poor discrimination).
Predictability Safeguard: Integrates predictability analysis to warn decision-makers when the epidemic dynamics are too chaotic for reliable forecasting, preventing over-reliance on models during high-uncertainty regimes.

4. Results (Application to COVID-19)

The framework was applied to state-level weekly incident COVID-19 case forecasts (Aug 2020 – Jan 2022) from the COVID-19 Forecast Hub.

Model Performance:
- Ensemble Models: Generally provided the highest value across most risk preferences and metrics (WIS, CRPS).
- Contextual Optimality: The "best" model varied significantly based on the decision-maker's risk appetite. For example, while ensembles were generally superior, specific models (like Karlen-pypm) sometimes outperformed others for specific high-risk thresholds or shorter horizons.
- Tail Risks: Standard metrics often masked poor performance at extreme quantiles. The user-specific Murphy diagrams revealed that some models systematically underpredicted high-incidence events, which is critical for resource allocation.
Predictability Insights:
- Epidemic predictability varied significantly across space and time.
- Correlation: Higher predictability was correlated with higher reproduction numbers ( $R_t$ ) and incident cases (growth/peak phases).
- Coverage Issues: During highly predictable growth phases, the coverage of 95% prediction intervals often dropped below 95%, indicating that models (even ensembles) underestimated uncertainty during rapid expansion.
Decision Value:
- The study demonstrated that a model performing best on average (e.g., lowest CRPS) does not necessarily provide the greatest value for a specific decision (e.g., "Should we open ICU beds if cases exceed 1,000?").
- REV curves showed that the relative value of models changes drastically depending on the decision-maker's Cost-Loss ratio.

5. Significance and Implications

Operationalizing Forecasts: The framework provides a concrete path to operationalize forecasts, ensuring they are "Useful, Usable, and Used" (3-U framework) by aligning them with real-world policy constraints.
Improved Trust: By explicitly acknowledging uncertainty and the subjective nature of risk preferences, the framework enhances the credibility and defensibility of modeling in policymaking.
Future-Proofing: The inclusion of predictability analysis helps stakeholders anticipate when models might fail due to regime shifts, preventing the erosion of trust during critical outbreak phases.
Interdisciplinary Bridge: It necessitates continuous dialogue between modellers, economists, and decision-makers to define realistic Cost-Loss ratios, moving infectious disease forecasting from a purely statistical exercise to a socio-technical decision-support system.

Conclusion:
The paper argues that forecast evaluation must be contextual. There is no single "best" model; there is only the best model for a specific decision-maker, at a specific time, with a specific risk tolerance. This framework provides the tools to identify and justify that choice.

From metric to action: The decision value of infectious disease forecasts

1. The "Scorecard" Problem

2. The New Framework: "The Decision-Maker's Menu"

3. The Real-World Test: COVID-19

The Big Takeaway

1. Problem Statement

2. Methodology: A Decision-Theoretic Framework

A. Core Workflow (Figure 1)

B. Key Technical Components

3. Key Contributions

4. Results (Application to COVID-19)

5. Significance and Implications

More like this

The effect of sedentary behaviour and physical activity on 1719 diseases: a Mendelian randomisation phenome-wide association study (MR-PheWAS)

Assessing the Impact of Timing and Coverage of United States COVID-19 Vaccination Campaigns: A Multi-Model Approach

Evidence on WASH interventions in Negelle-Arsi District, Oromia Regional State, Ethiopia: a cross-sectional data analysis

Identification of Spatiotemporal Associations of Social Determinants of Health on the Incidence of Adverse Birth Outcomes in Louisiana

Physical activity buffers physiological stress during high emotional distress: a wearable-derived prospective cohort study