📊 epidemiology

Horizon-dependent forecast ranking under structural change: a rolling-origin benchmark for global COVID-19 incidence

This study demonstrates that in global COVID-19 incidence forecasting under structural change, simple baseline models like drift and seasonal naive often outperform complex statistical methods, with optimal model performance being strongly dependent on the forecast horizon.

Original authors: Sesay, M. M., Wembo, M. S.

Published 2026-03-12

📖 6 min read🧠 Deep dive

CC BY 4.0

Original authors: Sesay, M. M., Wembo, M. S.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to predict the weather for your town. If you just look at the sky right now, you can guess what it will be like in an hour. But if you try to guess what the weather will be like two weeks from now, you need a much more complex model that accounts for seasons, ocean currents, and climate patterns.

This paper is about doing the exact same thing, but with COVID-19 cases instead of rain clouds. The researchers wanted to know: Which mathematical "weather forecast" works best for predicting the number of new virus cases?

Here is the story of their findings, explained simply.

The Big Problem: The World Was Changing Too Fast

Predicting the spread of a virus is hard because the rules keep changing. Sometimes people wear masks, sometimes they don't. Sometimes testing increases, sometimes it drops. The data is "non-stationary," which is a fancy way of saying the ground is constantly moving beneath our feet.

If you use a model trained on last month's data to predict next month, it might fail completely because the situation has shifted. The researchers realized that asking "Which model is the best?" is the wrong question. The better question is: "Which model is the best for this specific time frame?"

The Race: The Simple vs. The Complex

The researchers set up a race between two types of forecasters:

The "Simple Baselines" (The Old School): These are very basic models.
- Naive: "Tomorrow will be exactly like today."
- Seasonal Naive: "Next Tuesday will be exactly like last Tuesday."
- Drift: "The trend we see today will keep going in the same direction."
- Analogy: Imagine a driver who just keeps the car going in the same direction and speed they are currently doing.
The "Transformed Statistical Models" (The High-Tech): These are complex mathematical engines (like ARIMA, ETS, and Prophet) that try to find hidden patterns, trends, and cycles in the data.
- Analogy: Imagine a driver with a supercomputer, GPS, and satellite data trying to predict every pothole and turn.

The Results: It Depends on How Far You Look

The researchers tested these models to see how well they predicted the future at different "horizons" (how far ahead they were looking): 1 day, 3 days, 1 week, and 2 weeks.

Here is what they found:

The 1-Day and 2-Week Forecast (The Short and Long Haul):
The Drift model (the simple "keep going in the same direction" driver) won! It was surprisingly hard to beat. Even the complex supercomputers couldn't do much better.
- Why? When the virus is spreading fast, the "trend" is the strongest signal. A simple model that just follows the trend works better than a complex one that gets confused by trying to find patterns that don't exist yet.
The 3-Day Forecast (The Middle Ground):
The Seasonal Naive model won! This is the model that says, "Look at what happened exactly one week ago."
- Why? This suggests that even in a chaotic pandemic, there was a weekly rhythm (maybe people reported more cases on certain days of the week). The simple model caught this rhythm better than the complex ones.
The 7-Day Forecast (The One-Week Mark):
The Drift model won again.
The Complex Models (ARIMA vs. ETS):
The complex models were okay, but they had a rivalry.
- ARIMA was good for short-term predictions (1–3 days).
- ETS (Exponential Smoothing) was better for longer predictions (7–14 days).
- Analogy: ARIMA is like a sprinter who is fast for a short burst. ETS is like a marathon runner who gets stronger the longer the race goes.
The "Prophet" Model:
This model (made by Facebook) did terribly at predicting the exact number of cases. However, it was very "cautious." It drew huge, wide safety nets around its predictions.
- Analogy: Imagine a weather forecaster who says, "It might rain, or it might not, or it might be a hurricane." They are technically "right" because they covered all possibilities, but their prediction is useless because the "rain" could be a drizzle or a tsunami. They were too scared to be specific.

The "Rolling Origin" Test

How did they test this? Instead of training a model once and testing it once (like taking a single driving test), they used a "Rolling Origin" method.

Analogy: Imagine you are learning to drive. Instead of taking one test on Day 1, you take a test every single day for a month. On Day 2, you use what you learned on Day 1 to predict Day 2. On Day 3, you use Days 1 and 2 to predict Day 3.
This mimics real life, where we constantly update our predictions as new data comes in.

The "Structural Change" Twist

The researchers also noticed that the data changed in "phases."

Phase 1: The virus was just starting; not many countries were reporting.
Phase 2: The virus exploded; more countries started reporting.
Phase 3: The virus was everywhere; reporting was stable.

They found that the "best" model changed slightly depending on which phase the world was in. But the main lesson remained: Simple models are incredibly hard to beat.

The Big Takeaway

The most important lesson from this paper is that there is no "One True Model."

Context is King: You cannot just pick the "best" model and use it forever. If you are planning for tomorrow, use a simple trend model. If you are planning for next week, maybe use a different one.
Don't Dismiss the Simple Stuff: In a chaotic, changing world (like a pandemic), simple rules (like "keep going in the same direction") often work better than complex algorithms that try to overthink the data.
Check Your Data: Sometimes the data looks weird not because of the virus, but because more countries started reporting numbers. A good forecaster knows the difference between a real change in the virus and a change in how we count it.

In short: When the world is spinning out of control, sometimes the best way to predict the future is to just look at where you are going right now and assume you'll keep going that way. The fancy computers can wait.

1. Problem Statement

Forecasting infectious disease incidence is inherently difficult due to the nonstationarity of surveillance data. During epidemics like COVID-19, data series are affected by:

Structural changes: Sudden shifts in transmission dynamics, immunity, and behavioral responses.
Evolving reporting conditions: Changes in surveillance systems, testing capacity, and the number of reporting countries (coverage expansion).
Horizon dependency: Model performance often varies significantly depending on the forecast horizon (e.g., 1 day vs. 14 days).

Traditional evaluation methods often rely on a single train-test split, which can be fragile if the split coincides with an atypical epidemic phase. This study addresses the need for a robust evaluation framework that accounts for structural change and explicitly tests how forecast rankings vary across different time horizons.

2. Methodology

Data and Target Construction

Dataset: Global daily COVID-19 incidence data from the Johns Hopkins University (JHU) CSSE, spanning January 22, 2020, to July 27, 2020 ( $T=188$ days).
Target Variable: The primary target is the reported daily new cases ( $y_t$ ). A variance-stabilized transformation ( $z_t = \log(1 + y_t)$ ) was used for model estimation and retrospective segmentation.
Robustness Check: An alternative target was constructed using the first difference of cumulative confirmed counts to verify results against reporting artifacts.

Evaluation Protocol: Rolling-Origin Backtesting

The study employed a rolling-origin (walk-forward) protocol to mimic real-time forecasting:

Horizons: Forecasts were generated for $h \in \{1, 3, 7, 14\}$ days.
Training Window: An expanding window with a minimum length of $W_{min} = 56$ days was used by default. Models were re-estimated at each origin $t$ using data up to $t-1$ .
Metrics: Accuracy was measured using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), symmetric MAPE (sMAPE), and Mean Absolute Scaled Error (MASE) on the original incidence scale.
Statistical Testing: Pairwise differences in accuracy were assessed using the Diebold-Mariano (DM) test with absolute error loss.

Forecasting Models

The benchmark compared:

Simple Baselines:
- Naive: Persistence of the last observation ( $\hat{y}_{t+h} = y_t$ ).
- Seasonal Naive: Persistence of the value from 7 days ago ( $\hat{y}_{t+h} = y_{t+h-7}$ ).
- Drift: Extrapolation of the average historical trend.
Transformed Statistical Models (fitted on $\log(1+y_t)$ ):
- ARIMA(log1p): Order selected via AIC.
- ETS(log1p): Exponential Smoothing State Space models (trend/seasonality selected via AIC).
Reference Probabilistic Model:
- Prophet(log1p): Decomposed time series model (weekly seasonality enabled).

Structural Change Analysis

Retrospective Regime Segmentation: Breakpoints were detected on the variance-stabilized series ( $z_t$ ) using a cost-minimization approach (minimizing within-segment sum of squared deviations).
Usage: These breakpoints were used only to stratify forecast errors retrospectively, not for model training, to avoid information leakage.
Robustness Checks: The study tested sensitivity to segmentation settings, training window policies (expanding vs. sliding windows), coverage-stabilized subsamples (excluding early reporting expansion), and target definitions.

3. Key Contributions

Horizon-Wise Benchmarking: Demonstrated that forecast rankings are strongly horizon-dependent; no single model dominates across all timeframes.
Regime-Aware Evaluation: Showed that while structural phases exist, simple baselines remain competitive even when stratified by these phases.
Robustness to Design Choices: Validated that the core findings (horizon dependency and baseline competitiveness) are stable across different segmentation settings, training window lengths, and target constructions.
Prophet Diagnostics: Provided a critical analysis showing that high nominal prediction interval coverage does not equate to useful forecasting if the intervals are excessively wide (poor sharpness).

4. Key Results

Main Benchmark Findings

Horizon Dependency:
- 1-day, 7-day, 14-day: The Drift model performed best (lowest MAE).
- 3-day: Seasonal Naive performed best, indicating that weekly reporting cycles retain predictive value even in a nonstationary global aggregate.
Statistical Models:
- ARIMA(log1p) was competitive at short horizons (1 and 3 days).
- ETS(log1p) became superior to ARIMA at longer horizons (7 and 14 days).
- Prophet(log1p) was uncompetitive for point forecasting, exhibiting significantly higher errors than all other models.
Statistical Significance (Diebold-Mariano Tests):
- Drift significantly outperformed ARIMA and ETS at 1, 7, and 14 days.
- ETS significantly outperformed ARIMA at 7 and 14 days.
- At 1 day, ETS and ARIMA were statistically indistinguishable.

Prophet Uncertainty Behavior

Prophet achieved high empirical coverage (e.g., 96.6% for an 80% nominal interval at $h=1$ ) but only by generating extremely wide prediction intervals (mean width ~428k cases vs. actual daily incidence of ~80k).
This indicates over-conservative uncertainty quantification: the model is "calibrated" in a loose sense but lacks sharpness, making it operationally useless for decision-making.

Robustness Analyses

Segmentation: Breakpoint locations were moderately stable; the main conclusion (horizon dependency) held regardless of the specific regime definition.
Training Windows: Sliding windows (e.g., 84 or 112 days) improved ETS performance at medium/long horizons compared to expanding windows, highlighting a trade-off between adaptation speed and estimation stability.
Coverage Stabilization: When restricting the sample to periods where reporting countries were stable ( $\ge 180$ ), the short-horizon ranking shifted slightly (ETS became best at 3 days), but Drift remained the strongest at 7 and 14 days, confirming that simple baselines are robust even after reporting artifacts are minimized.
Target Definition: Using reconstructed incidence (cumulative differences) did not alter the model ranking structure.

5. Significance and Implications

Rejection of "One-Size-Fits-All": The study concludes that there is no single "best" model for epidemic forecasting. Evaluation must be horizon-specific. A model suitable for 1-day situational awareness (e.g., Naive/Drift) may differ from one suitable for 2-week capacity planning (e.g., Drift/ETS).
Value of Simple Baselines: In the presence of structural change and nonstationarity, simple baselines like Drift and Naive are highly competitive and should be treated as serious reference models, not just trivial benchmarks. They effectively capture the dominant trends and local continuity of global aggregates.
Evaluation Frameworks: Public health forecasting must use rolling-origin protocols and explicitly account for reporting changes (coverage expansion) to avoid spurious conclusions.
Probabilistic Forecasting: High coverage rates are insufficient; sharpness (interval width) is critical. Models that produce wide intervals to guarantee coverage may fail to provide actionable insights.

Conclusion: The paper establishes that under structural change, forecast rankings are dynamic and horizon-dependent. Simple trend-extrapolation models (Drift) often outperform complex statistical models in global aggregate data, and rigorous, horizon-specific evaluation is essential for credible benchmarking.