EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes

Imagine you are trying to predict when the next big wave will crash on the shore. For decades, seismologists (earthquake scientists) have used a very specific, well-tested recipe called ETAS to do this. It's like a seasoned chef who knows exactly how ingredients interact: if a big wave hits, smaller waves will follow for a while. It's not perfect, but it's the gold standard.

Recently, a new generation of "AI chefs" arrived, armed with Neural Point Processes (NPPs). These are fancy machine learning models that claim to be more flexible and powerful than the old recipe. They promise to learn complex patterns directly from data without needing a human to write the rules.

But here's the problem: The previous "cooking competitions" used to test these AI chefs were rigged. They used old, messy data, left out the biggest waves (the 2011 Tohoku earthquake), and even let the chefs peek at the answers before they started cooking.

Enter "EarthquakeNPP": The New, Fair Cooking Competition.

This paper introduces a brand new, fair benchmark called EarthquakeNPP. Think of it as a brand new, high-stakes cooking show where the judges are the actual earthquake experts, and the ingredients are real, messy, real-world earthquake data from California spanning 50 years.

Here is what the paper found, explained simply:

1. The Setup: A Fair Fight

The researchers gathered five of the most popular AI models (the "Neural Point Processes") and pitted them against the old-school ETAS model. They used five different datasets, ranging from huge areas of California to specific fault lines, covering everything from tiny tremors to massive quakes.

They tested the models using two types of judging:

The Math Test (Log-Likelihood): How well does the model calculate the probability of an earthquake happening?
The Simulation Test (CSEP): Can the model run 10,000 simulations of the future and actually look like the real world? This is the "real-world" test.

2. The Results: The AI Chefs Lost

The verdict was surprising but clear: None of the AI models beat the old-school ETAS model.

The "Big Wave" Problem: When a massive earthquake happened (like the 2010 El Mayor-Cucapah quake), the AI models got confused. They couldn't predict the swarm of aftershocks that followed. The ETAS model, however, handled these big events beautifully.
- Analogy: Imagine a weather app that is great at predicting sunny days but completely fails when a hurricane hits. The AI models are like that app; they are good at "background noise" but fail when the real drama starts.
The Missing Ingredient: The secret sauce of the ETAS model is that it explicitly knows that bigger earthquakes cause more aftershocks. The AI models were trying to learn this on their own, but they weren't "told" to pay attention to the size of the earthquake. They were like chefs trying to guess the recipe without knowing that salt is the most important ingredient.

3. Why Did the AI Fail?

The paper suggests the AI models are missing three key things:

They ignore the "Size" of the event: They treat a magnitude 3 quake and a magnitude 7 quake too similarly. In reality, a magnitude 7 is a game-changer that triggers a chain reaction.
They have "Short Memories": To save computer power, the AI models only look at the last 20 earthquakes. But earthquakes can trigger events years later or hundreds of miles away. The old ETAS model remembers everything.
They are bad at "Long-Term Planning": The AI models are great at predicting the next event, but terrible at simulating a whole month of future earthquakes. It's like a GPS that tells you the next turn perfectly but gets lost if you ask it to plan a whole road trip.

4. The Silver Lining

It's not all bad news! The AI models did show promise in "boring" times. When there were no big earthquakes happening, the AI models were actually quite good at spotting subtle, weird patterns that the old model missed. They are flexible and can learn complex, messy background noise.

The Bottom Line

The paper concludes that while Neural Point Processes are exciting and powerful, they aren't ready for prime time yet. They cannot replace the trusted ETAS model for predicting dangerous earthquakes because they struggle with the biggest, most dangerous events.

What's Next?
The authors aren't saying "AI is useless." They are saying, "We need to build better AI." They suggest future models should:

Be explicitly told to care about earthquake size.
Have longer memories.
Be trained to simulate whole sequences, not just the next event.

EarthquakeNPP is now open for everyone to use. It's a public playground where scientists can come to build better AI models, test them fairly, and hopefully, one day, create a system that can save lives by predicting the next big shake.

Here is a detailed technical summary of the paper "EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes."

1. Problem Statement

The paper addresses the critical gap between the rapid advancements in Neural Point Processes (NPPs) within the machine learning community and their practical applicability in operational earthquake forecasting by the seismology community.

Current Limitations of NPP Benchmarks: Existing benchmarks (e.g., Chen et al., 2021) suffer from severe methodological flaws:
- Data Leakage: They use non-chronological train-test splits (alternating segments), allowing models to exploit backward-in-time causal dependencies, artificially inflating performance.
- Omission of Critical Events: They exclude major earthquake sequences (e.g., the 2011 Tohoku earthquake), which are the most damaging and crucial for forecasting.
- Incomplete Baselines: They fail to compare NPPs against the state-of-the-art (SOTA) parametric model used in seismology, the Epidemic-Type Aftershock Sequence (ETAS) model.
- Data Bias: They often rely on incomplete catalogs that miss smaller magnitude events or suffer from policy changes in data collection.
The Core Challenge: While NPPs promise greater flexibility than classical models, it remains unclear if they can outperform ETAS in realistic, operational forecasting scenarios that require handling long-term dependencies, magnitude scaling, and complex spatio-temporal clustering.

2. Methodology: The EarthquakeNPP Platform

The authors introduce EarthquakeNPP, a comprehensive benchmarking platform designed to standardize evaluation and bridge the gap between ML and seismology.

A. Dataset Curation

The platform curates and standardizes public earthquake catalogs from California (1971–2021), covering diverse regions and detection methodologies:

ComCat: USGS global catalog (standard operational data).
SCEDC: Southern California Seismic Network (high resolution, multiple magnitude thresholds).
White: Enhanced catalog for the San Jacinto Fault Zone (dense network, detecting events down to $M_w 0.6$ ).
QTM: Template-matching catalogs (Ross et al., 2019) for San Jacinto and Salton Sea.
Preprocessing: Datasets are truncated at magnitude completeness ( $M_c$ ) to ensure data quality, with chronological splits (Burn-in $\to$ Train $\to$ Validation $\to$ Test) to prevent data leakage.

B. Benchmark Models

The study evaluates five distinct NPP architectures against the ETAS model and a homogeneous Poisson process:

NSTPP: Probability density function (PDF) based using Continuous Normalizing Flows (CNFs).
DeepSTPP: Conditional intensity function based using deep latent processes.
AutoSTPP: Jointly models the intensity and its derivative using a dual-network approach.
DSTPP: A generative model using diffusion processes (no explicit likelihood).
SMASH: A generative model using score-matching for pseudolikelihood estimation (no explicit likelihood).

C. Evaluation Protocols

The platform employs two rigorous evaluation frameworks:

Log-Likelihood Metrics: Standard temporal and spatial log-likelihood scores to measure predictive density.
CSEP Consistency Tests: Adopting the protocol from the Collaboratory for the Study of Earthquake Predictability. This involves:
- Generating 10,000 simulated earthquake sequences for 24-hour forecasts.
- Comparing observed data against the distribution of simulations using four test statistics: Number (Temporal), Spatial, Pseudo-Likelihood, and Magnitude.
- Calculating pass rates and Kolmogorov-Smirnov (KS) statistics to assess calibration.

3. Key Results

A. Performance Comparison (Log-Likelihood)

ETAS Dominance: The ETAS model consistently achieved the highest temporal and spatial log-likelihood across all datasets.
NPP Performance: None of the five NPPs outperformed ETAS.
- Temporal: NPPs performed comparably to ETAS during "background" periods (low activity) but struggled significantly during large earthquake sequences (e.g., 2010 El Mayor-Cucapah, 2019 Ridgecrest).
- Spatial: ETAS significantly outperformed NPPs in spatial likelihood, particularly in highly clustered regions near major faults. NPPs performed slightly better in spatially complex or diffuse seismicity but failed to capture the tight clustering of aftershocks.
Magnitude Dependence: The performance gap widened when magnitude thresholds were lowered, suggesting NPPs benefit from small events but lack the explicit magnitude-scaling mechanisms that allow ETAS to model large aftershock cascades effectively.

B. CSEP Consistency Tests

Calibration: ETAS demonstrated the highest pass rates and lowest KS statistics, indicating superior calibration.
Generative Models (DSTPP/SMASH):
- SMASH: Showed moderate performance but exhibited highly variable, "spiky" daily rate forecasts, leading to frequent over- and under-prediction.
- DSTPP: Produced smooth forecasts but systematically underestimated seismicity rates across both background and active periods, resulting in very low pass rates (e.g., 0.5% pass rate on the White dataset).
Operational Feasibility: Models like NSTPP, DeepSTPP, and AutoSTPP could not be evaluated via CSEP tests because their architectures do not support efficient simulation of long event sequences (sampling is prohibitively slow).

C. Computational Efficiency

Training: ETAS training scales as $O(n^2)$ due to summation over all past events, making it slow for large datasets. NPPs (DeepSTPP, AutoSTPP) utilizing sliding windows ( $k=20$ ) trained significantly faster.
Inference/Simulation: ETAS simulation is efficient ( $O(n \log n)$ ) via branching process sampling. In contrast, many NPPs require solving ODEs or running diffusion steps, making real-time operational forecasting infeasible for current implementations.

4. Key Contributions

EarthquakeNPP Benchmark: A standardized, open-source platform providing curated datasets, preprocessing pipelines, and evaluation protocols specifically designed for earthquake forecasting.
Rigorous Baseline: The first comprehensive comparison of modern NPPs against the operational SOTA (ETAS) using seismologically accepted metrics (CSEP tests).
Identification of Architectural Gaps: The paper pinpoints specific reasons for NPP underperformance:
- Lack of explicit magnitude dependence (ETAS uses magnitude to scale triggering; NPPs generally do not).
- Inability to model long-term memory (NPPs truncate history due to computational limits; ETAS uses the full history).
- Mismatch between training objectives (next-event prediction) and operational evaluation (sequence simulation).

5. Significance and Future Directions

The paper concludes that current NPP implementations are not yet suitable for practical earthquake forecasting, as they fail to outperform the classical ETAS model in realistic scenarios.

However, the study provides a clear roadmap for future research (Actionable Directions):

Encode Magnitude Dependence: Integrate magnitude-aware mechanisms (e.g., hierarchical encodings) into NPPs.
Scalable Long-Term Memory: Develop architectures (e.g., sparse attention, memory compression) that can process the full event history without truncation.
Align Training and Evaluation: Train generative models using objectives that optimize for long-horizon sequence simulation rather than just next-event prediction.
Hybrid Architectures: Combine the flexibility of neural networks with the physically motivated scaling laws (e.g., power-law kernels) found in ETAS.

Conclusion: EarthquakeNPP serves as a vital tool to foster collaboration between the machine learning and seismology communities, ensuring that future model developments are directly relevant to operational risk reduction and government forecasting agencies.