Rolling-Origin Validation Reverses Model Rankings in Multi-Step PM10 Forecasting: XGBoost, SARIMA, and Persistence

Imagine you are trying to predict the weather for the next week to decide when to wash your car, go to the beach, or stay indoors. You have three different "crystal balls" (models) to help you:

The Lazy Neighbor (Persistence): This model assumes tomorrow will be exactly like today. If it's sunny now, it guesses it will be sunny tomorrow. It's simple, but often surprisingly accurate because weather tends to stick around for a bit.
The Classic Statistician (SARIMA): This model uses old-school math to find patterns, like how the weather changes with the seasons. It's smart but sticks to the rules.
The Super-Computer (XGBoost): This is a fancy, high-tech AI that can spot incredibly complex patterns and nonlinear relationships. It's the "smartest" looking tool in the room.

The Big Mistake: The "One-Time Test"

For years, scientists tested these crystal balls by splitting their data into two piles: a "training" pile (to teach the models) and a "test" pile (to grade them). They did this once.

In this "One-Time Test," the Super-Computer (XGBoost) looked like the undisputed champion. It seemed to beat the Lazy Neighbor and the Classic Statistician every single day for a whole week. Everyone assumed the fancy AI was the best choice for real life.

The Reality Check: The "Rolling Test"

The authors of this paper realized there was a flaw in that test. In the real world, you don't get to see the future data before you make a prediction. You have to update your model every day as new data comes in, like a coach adjusting a game plan after every match.

So, they ran a new test called Rolling-Origin Validation. Imagine this:

You make a prediction for tomorrow using data up to today.
Tomorrow comes, you get the real data, and you update your model.
You make a prediction for the day after tomorrow using the new data.
You repeat this process for months, simulating how a real forecast system would actually work.

The Shocking Result: The Rankings Flipped!

When they ran this realistic, "rolling" test, the results changed completely. It was like watching a race where the favorite runner suddenly stumbled.

The Super-Computer (XGBoost) Crashed: In the real-world simulation, the fancy AI actually performed worse than the Lazy Neighbor for the first few days! It was so overconfident in its complex patterns that it got confused by the daily noise. It only started doing well after 5 or 6 days, but by then, the forecast was already too vague to be useful.
The Classic Statistician (SARIMA) Won: The old-school math model didn't try to be too fancy. It stuck to the reliable patterns and consistently beat the Super-Computer every single day for the whole week.
The Lazy Neighbor (Persistence): Even the simple "it will be like today" guess was often better than the AI in the short term.

The "Predictability Horizon" (H*)

The authors introduced a new way to measure success called the Predictability Horizon. Think of this as a "Usefulness Timer."

The Question: "How many days ahead can I trust this model before it becomes no better than just guessing 'it will be like today'?"
The Old View: The Super-Computer had a timer of 7 days (because it looked good in the one-time test).
The New View: In the real world, the Super-Computer's timer was broken. It was only useful for a few days, and even then, it wasn't beating the simple guess. The Classic Statistician, however, had a reliable 7-day timer.

Why This Matters to You

This paper teaches us a very important lesson about technology and science: Just because a model looks amazing in a controlled lab test doesn't mean it will work in the messy real world.

Don't be fooled by complexity: A fancy, expensive AI isn't always better than a simple, reliable method. Sometimes, "less is more."
Test like you fly: If you are building a system to predict air pollution (PM10) to warn people about health risks, you can't just test it once. You have to test it the way you will actually use it—updating it daily and seeing if it stays reliable over time.
The "Lazy" Baseline is a Hero: Always compare your smart models against the simplest possible guess. If your smart model can't beat the "Lazy Neighbor," it's not actually adding any value.

In short: The paper warns us that in the world of forecasting, the "flashiest" tool often loses the race when the starting line moves every day. The reliable, steady hand (SARIMA) wins over the overconfident genius (XGBoost) when the stakes are real.

1. Problem Statement

Air quality forecasting, specifically for PM10, is critical for operational decision-making (e.g., traffic regulation, public health advisories) under EU Directive 2008/50/EC. However, current literature suffers from two major methodological flaws:

Static Validation: Most studies use a single static train-test split, which fails to replicate the sequential, updating nature of real-world operational forecasting.
Lack of Baseline Comparison: Evaluations often rely on aggregate error metrics (RMSE, MAE) without comparing models against a persistence baseline (the assumption that future values equal the most recent observation). In highly autocorrelated environmental series, a model may show lower error than a static baseline but still fail to outperform simple temporal inertia, rendering it operationally useless.

The core problem is that static evaluation designs may overstate model performance and lead to incorrect rankings, potentially causing practitioners to deploy complex models that offer no genuine added value over simple baselines in real-time scenarios.

2. Methodology

The study employs a rigorous, leakage-safe experimental design to compare three forecasting families across a 1-to-7-day horizon:

Data: 2,350 daily PM10 observations (2017–2024) from an urban background station in Elche, Spain.
Models:
1. Persistence: The naive baseline (future = current).
2. SARIMA: A classical statistical model capturing linear autoregression and seasonality.
3. XGBoost: A flexible machine learning model capable of modeling non-linear interactions.
Evaluation Protocols:
- Static Split: A single chronological split (2017–2022 training, 2023 testing).
- Rolling-Origin Validation: A sequential protocol where the model is retrained monthly (2020–2023) using an expanding window. Crucially, train-only preprocessing is used to prevent data leakage from future observations.
Metrics:
- Absolute Error: RMSE and MAE.
- Persistence-Relative Skill ( $SS_m(h)$ ): Defined as $1 - \frac{Err_m(h)}{Err_{pers}(h)}$. A positive value indicates the model outperforms persistence; a negative value indicates it performs worse.
- Predictability Horizon ( $H^*$ ): Defined as the maximum horizon $h$ where $SS_m(h) > 0$ . This metric quantifies the operational "useful life" of a forecast.

3. Key Contributions

Reproducible Evaluation Design: Introduces a framework that distinguishes genuine forecast skill from artifacts caused by static validation and non-causal preprocessing.
Operational Metric ( $H^*$ ): Proposes the Predictability Horizon as a standard for defining when a model ceases to be operationally useful relative to a naive baseline.
Empirical Evidence of Ranking Instability: Demonstrates that model rankings are not invariant to the evaluation design. A model appearing superior under static splits can be inferior under realistic rolling-origin conditions.

4. Key Results

The study reveals a dramatic reversal in model performance depending on the validation protocol:

Under Static Split Evaluation:
- XGBoost appeared uniformly superior to persistence across all horizons (1–7 days), with skill scores ( $SS$ ) ranging from 0.23 to 0.30.
- It yielded a nominal $H^* = 7$ , suggesting it provided stable multi-step added value.
- SARIMA also performed well but was ranked lower than XGBoost.
Under Rolling-Origin Evaluation (Realistic Deployment):
- XGBoost Performance Collapsed: The model failed to consistently outperform persistence.
  - At $h=1$ , mean skill was negative ( $SS = -0.192$ ).
  - At $h=3$ , skill was near zero or negative.
  - It only showed positive skill at longer horizons ( $h=5–7$ ).
- SARIMA Performance Remained Robust:
  - Maintained positive skill across the entire range (1–7 days).
  - Skill increased with horizon, peaking at $h=6$ ( $SS = 0.203$ ).
  - Ranking Reversal: SARIMA outperformed XGBoost at every forecast horizon under the rolling-origin protocol.

Conclusion: The static split artificially inflated XGBoost's performance, likely due to overfitting or data leakage in preprocessing. In a realistic, sequential deployment, the simpler SARIMA model proved more reliable and operationally useful than the complex machine learning model.

5. Significance and Implications

For Researchers: Static splits can lead to misleading conclusions about model superiority. Future studies must use rolling-origin validation with train-only preprocessing to ensure results reflect operational reality.
For Practitioners:
- Complexity is not a proxy for value: More complex models (like XGBoost) do not guarantee better operational performance, especially in highly autocorrelated series.
- Benchmarking is essential: Models must be evaluated against a persistence baseline to determine if they offer actionable skill.
- Horizon-specific reliability: Decision-makers should not assume uniform model superiority. In this case, SARIMA is the preferred choice for 1–7 day PM10 forecasting, while XGBoost offers no advantage at short-to-intermediate lead times.
Methodological Shift: The paper advocates for shifting the focus from "lowest aggregate error" to "sustained persistence-relative skill" and the Predictability Horizon ( $H^*$ ) as the primary criteria for model selection in environmental time-series forecasting.

Rolling-Origin Validation Reverses Model Rankings in Multi-Step PM10 Forecasting: XGBoost, SARIMA, and Persistence

The Big Mistake: The "One-Time Test"

The Reality Check: The "Rolling Test"

The Shocking Result: The Rankings Flipped!

The "Predictability Horizon" (H*)

Why This Matters to You

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

JointFM-0.1: A Foundation Model for Multi-Target Joint Distributional Prediction

MARLIN: Multi-Agent Reinforcement Learning for Incremental DAG Discovery

Collaborative Adaptive Curriculum for Progressive Knowledge Distillation

Transformer-Based Predictive Maintenance for Risk-Aware Instrument Calibration

Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations