Real-Time Long Horizon Air Quality Forecasting via Group-Relative Policy Optimization

The Big Problem: The "Crystal Ball" That Cracks

Imagine you are a mayor trying to protect your city from a toxic fog (pollution). You need to know 48 to 120 hours (2 to 5 days) in advance if the air will be safe or deadly.

Currently, the "super-forecasters" (global AI models like Aurora) are like a weatherman who has seen the whole world but never visited your specific town.

The Issue: They are great at general patterns but terrible at local details. In East Asia, where the terrain is complex and pollution is intense, these global models often miss the mark.
The Consequence: They either say "It's going to be a disaster!" when it's actually fine (causing panic and loss of trust), or they say "It's fine!" when a toxic cloud is actually rolling in (putting people's health at risk).

The Solution: FAKER-Air

The researchers built a new system called FAKER-Air. Think of it as training a local expert who knows your neighborhood better than anyone else, using a special two-step training method.

Step 1: The "Local Map" (The Dataset)

Before teaching the AI how to predict, they had to give it the right map.

The Old Map: Global models used data that was 5 days old and averaged out over huge areas. It was like trying to navigate a city using a map of the whole continent.
The New Map (CMAQ-OBS): The team created a brand new, high-definition map specifically for East Asia. They combined real-time ground sensors (like having a weather station on every street corner) with a physics-based simulation (a super-accurate computer model of how wind and pollution move).
The Result: This new map is 60% more accurate than the old global maps. It's like upgrading from a blurry satellite photo to a live 4K drone feed of your city.

Step 2: The "Two-Stage Training" (The Coach)

Now, they had to teach the AI how to use this map without making the same mistakes as the global models. They used a two-stage coaching method:

Stage 1: The "Drill Sergeant" (Supervised Fine-Tuning)

The Problem: If you teach a student to solve a math problem step-by-step, but only show them the correct answer at every step, they will fail when they have to solve it alone. In AI, this is called "Exposure Bias." If the AI makes a tiny mistake at hour 1, it gets worse and worse by hour 100.
The Fix: The researchers made the AI practice rolling out its own predictions. Instead of just looking at the correct answer, the AI had to predict hour 1, then use its own prediction to guess hour 2, and so on.
The Analogy: It's like teaching a driver not just to look at the road, but to steer the car, then look at where they steered, and then steer again. This stops the AI from falling apart after a few hours.

Stage 2: The "Smart Coach" (Group-Relative Policy Optimization - GRPO)

The Problem: Even with good practice, the AI was still too "nervous." It kept predicting "Disaster!" even when the air was clean. Why? Because in the real world, missing a disaster is worse than a false alarm. But standard math treats both errors the same.
The Fix: They introduced a new way of grading called GRPO.
- How it works: Instead of asking the AI to give one answer, they ask it to generate five different possible futures for the same day.
- The Comparison: The AI then compares these five futures. "Hey, Scenario A said it would be clean, and it was. Scenario B said it would be toxic, but it wasn't. Scenario B was a false alarm!"
- The Reward: The AI gets a "gold star" for the scenarios that matched reality and a "time-out" for the ones that caused false alarms.
- The Analogy: Imagine a coach watching a player practice penalty kicks. Instead of just saying "Good shot," the coach says, "You tried 5 shots. Three went wide (false alarms), one hit the post (missed event), and one went in (correct). Let's focus on hitting the target without wasting energy on shots that clearly miss."
- The Result: The AI learns to be confident but cautious. It stops crying wolf (false alarms) but still screams when a real tiger is coming (severe pollution).

The "Curriculum" (Learning to Walk Before Running)

One final trick they used was Curriculum Rollout.

The Idea: You wouldn't ask a baby to run a marathon on day one.
The Method: They started by teaching the AI to predict just 6 hours ahead. Once it got good at that, they extended it to 12 hours, then 24, all the way up to 120 hours.
Why: This prevents the AI from getting overwhelmed by the complexity of a 5-day forecast before it has mastered the basics.

The Final Scorecard

When they tested FAKER-Air against the global champion (Aurora):

False Alarms: Dropped by 47%. (The AI stopped crying wolf, so people actually listened when it did warn them).
Accuracy: Improved significantly, especially for long-term forecasts (2 to 5 days out).
Reliability: It successfully predicted complex pollution events that the global models completely missed, including pollution traveling across borders.

In a Nutshell

The world's best global AI models are like general practitioners who know a little about everything but aren't great at your specific local problem. FAKER-Air is a specialist doctor who:

Has a local map (CMAQ-OBS dataset) specific to East Asia.
Practices self-correction (Temporal Accumulation) so small mistakes don't snowball.
Learns from comparing multiple guesses (GRPO) to understand that "crying wolf" is bad, but "missing a tiger" is worse.

This creates a system that is ready for real-world use, helping governments issue timely, trustworthy warnings to protect public health.

1. Problem Statement

Accurate long-horizon (48–120 hours) forecasting of particulate matter (PM2.5 and PM10) is critical for public health alerts and emission control in East Asia. However, existing solutions face three major challenges:

Regional Inaccuracy of Foundation Models: Global foundation models (e.g., Aurora, Pangu-Weather) trained on global reanalysis data (like ERA5 or CAMS) fail to capture East Asia's complex terrain, strong atmospheric dynamics, and specific emission patterns. They exhibit large systematic biases (e.g., CAMS has an average error of ~52.66 µg/m³ in the region) and often miss severe pollution events.
Operational Latency: Global datasets often suffer from multi-day update delays (e.g., 5 days for CAMS), making them unsuitable for real-time warning systems that require initialization within hours.
Decision-Cost Mismatch: Standard supervised learning (SFT) optimizes for symmetric metrics like Mean Squared Error (MSE). In air quality operations, the costs are asymmetric: missing a severe pollution event (False Negative) endangers public health, while a false alarm (False Positive) erodes public trust. SFT models tend to over-predict to minimize MSE, leading to high False Alarm Rates (FAR).

2. Methodology: FAKER-Air Framework

The authors propose FAKER-Air (Forecast Alignment via Knowledge-guided Expected-Reward), a two-stage training framework designed to achieve both regional accuracy and operational reliability.

A. Dataset Construction: CMAQ–OBS

To address data scarcity and latency, the authors constructed and released a new regional dataset for East Asia (2016–2023):

Observations (OBS): Real-time ground measurements from 532 stations in Korea and 1,290–1,781 stations in China, aggregated to 6-hour intervals.
Reanalysis (CMAQ): High-resolution (27 km) Community Multiscale Air Quality (CMAQ) model outputs tailored to East Asian meteorology and emissions.
Advantage: CMAQ reduces regional error by 59.5% compared to global CAMS data and supports initialization within hours, enabling real-time forecasting.

B. Stage 1: Supervised Fine-Tuning (SFT) with Temporal Accumulation Loss

Base Model: An Aurora-based 3D encoder-decoder architecture.
Temporal Accumulation (TA) Loss: Standard SFT uses "teacher forcing" (feeding ground truth at every step), causing exposure bias where the model fails during auto-regressive inference. The authors introduce a TA loss that supervises multi-step rollouts (up to $T=4$ steps). This penalizes errors along the entire trajectory, forcing the model to learn temporal consistency and reducing error accumulation over long horizons.

C. Stage 2: Group-Relative Policy Optimization (GRPO)

To align predictions with operational costs, the framework moves beyond regression to policy optimization:

Objective: Maximize a reward based on the Air Quality Index (AQI) classification rather than raw concentration values.
Class-wise Rewards: The reward function is asymmetric:
- False Alarms (Good/Moderate predicted as Bad): Heavily penalized to reduce FAR.
- Missed Events (Bad/VeryBad predicted as Good): Heavily penalized to maintain recall for severe events.
- True Positives: Rewarded.
GRPO Mechanism: Instead of using an absolute reward model, GRPO generates a group of $G$ trajectories (rollouts) for the same input. It ranks these trajectories based on their AQI rewards and updates the policy to increase the likelihood of higher-reward trajectories relative to the group average. This eliminates the need for a separate critic model and stabilizes training.
Curriculum Rollout: Training starts with short horizons (1-step) and gradually extends to longer horizons (up to 4-steps) to stabilize gradient variance and improve long-term credit assignment.

3. Key Contributions

Regional Dataset: Release of the first CMAQ–OBS dataset for East Asia, reducing regional error by 59.5% compared to global baselines and enabling real-time initialization.
Two-Stage Training Framework: A novel combination of SFT with Temporal Accumulation Loss (to fix exposure bias) and GRPO (to align with asymmetric operational costs).
First Application of Policy Optimization in Spatio-Temporal Forecasting: Demonstrates that reinforcement learning techniques (GRPO) can effectively bridge the gap between numerical accuracy and decision reliability in climate forecasting.

4. Experimental Results

Evaluated on 120-hour forecasts for PM2.5 and PM10:

Accuracy Improvement: The FAKER-Air model achieves a 3.5× improvement in F1-score over the Aurora baseline (e.g., PM2.5 F1: 16.06 $\to$ 59.90).
Operational Reliability:
- False Alarm Rate (FAR): Reduced by 47.3% (from 32.86% to 17.32%) compared to the SFT-only baseline.
- Bias: Achieved near-ideal bias (0.96), correcting the over-prediction tendency of SFT models.
- CSI (Critical Success Index): Improved significantly, indicating better balance between detecting severe events and avoiding false alarms.
Qualitative Performance: Visualizations show that FAKER-Air maintains coherent spatial structures and transboundary pollution transport patterns up to 96–120 hours, whereas Aurora collapses to uniform predictions or fragmented noise.

5. Significance

This work addresses a critical gap in environmental AI by moving beyond "accuracy at all costs" to decision-grade forecasting.

Public Health Impact: By reducing false alarms while maintaining high recall for severe pollution, the system builds public trust and ensures timely protective actions (e.g., school closures, traffic restrictions) without unnecessary panic.
Methodological Shift: It proves that integrating physics-informed reanalysis (CMAQ) with decision-aware policy optimization (GRPO) is a viable path for solving long-horizon forecasting problems in regions with complex dynamics.
Open Science: The release of the dataset and code provides a new benchmark for regional air quality research, overcoming the limitations of global foundation models in specific high-pollution regions.