📊 epidemiology

Predicting COVID-19 incidence from seroprevalence and population-based cohort data using interpretable machine learning with differential privacy analysis

This study demonstrates that integrating interpretable machine learning with differential privacy on aggregated seroprevalence and cohort data from Germany enables accurate prediction of local COVID-19 incidence and the identification of key behavioral and immunological transmission drivers, offering a valuable complement to routine surveillance for public health decision-making.

Original authors: Krepel, J., Binkyte, R., Kerkouche, R., Harries, M., Klett-Tammen, C. J., Fritz, M., Kesselheim, S., Kuehn, M., Bazarova, A., Lange, B.

Published 2026-04-02

📖 4 min read☕ Coffee break read

CC BY 4.0

Original authors: Krepel, J., Binkyte, R., Kerkouche, R., Harries, M., Klett-Tammen, C. J., Fritz, M., Kesselheim, S., Kuehn, M., Bazarova, A., Lange, B.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to predict the weather. Traditionally, meteorologists look at the thermometer and barometer (the "official" data). But what if you could also ask thousands of people, "Did you feel a chill yesterday?" or "Did you wear a coat?" or "Did you check your temperature?"

This paper is about doing exactly that for the COVID-19 pandemic, but with a high-tech twist.

The Big Idea: The "Community Weather Report"

During the pandemic, governments relied on official case counts (like a thermometer) to decide when to lock down or open up. But official counts are often incomplete; they only catch people who go to the doctor.

The researchers used a massive study called MuSPAD, which is like a giant, ongoing survey of thousands of regular people in Germany. These people gave blood samples (to check for antibodies) and filled out questionnaires about their lives: Did you lose your job? Did you wear a mask at a restaurant? Did you get tested?

The team asked: "Can we use this 'community weather report' to predict where the virus is going next, even before the official numbers catch up?"

The Tools: The Crystal Ball vs. The Time Machine

To answer this, they built several "crystal balls" (Machine Learning models) to predict the virus's spread 7 days into the future.

The Snapshot Models (LASSO & MLP): These models look at a single day's data and try to guess the future. It's like looking at a single photo of a storm cloud and guessing if it will rain tomorrow.
The Time-Travel Models (LSTM & VAR): These models are smarter. They don't just look at today; they remember the last week, two weeks, or three weeks. It's like watching a movie of the storm clouds moving, rather than just looking at one frame. They understand that a storm doesn't just appear; it builds up.

The Result: The "Time-Travel" models were the best. They could predict the virus's movement much more accurately than just looking at the raw numbers alone.

The Clues: What Actually Predicts the Virus?

The researchers didn't just want a prediction; they wanted to know why the models made those predictions. They used "X-Ray glasses" (Explainable AI) to see which clues mattered most.

Here are the top clues they found:

The "Restaurant Risk" Signal: The most consistent clue was whether people were wearing masks at restaurants. If people said, "We aren't wearing masks at dinner," the model predicted a spike in cases. It's like seeing a dry forest and predicting a fire.
The "Job Change" Signal: Surprisingly, changes in employment were a huge predictor. When people lost jobs or changed work situations, it often signaled a shift in how the virus was spreading. It's like noticing that the traffic patterns changed, which tells you something about the city's mood.
The "Testing" Signal: The models noticed that when people didn't report their test results, it often meant the virus was spreading quietly. Missing data was actually a loud signal!

The Privacy Shield: The "Blurred Photo" Experiment

Here is the most unique part of the paper. The researchers knew that asking people about their health is sensitive. They wanted to make sure no one could figure out who specifically said what.

They used a technique called Differential Privacy. Imagine taking a photo of a crowd and blurring every face just enough so you can't identify anyone, but you can still see if the crowd is angry or happy.

The Trade-off: They tested how much they could blur the faces (add "noise" to the data) before the crystal ball stopped working.
The Finding: Even with a heavy blur (strong privacy), the models still worked pretty well! They could still predict the virus spread.
The Catch: The "X-Ray glasses" (Explainable AI) got a bit fuzzy. One type of glasses (SHAP) stayed clear enough to read, but the other (LIME) got too blurry to trust when the privacy was too strict.

The Takeaway

This paper proves that we don't just need to count sick people to understand a pandemic. We need to listen to the "pulse" of the community—what they are doing, where they are working, and how they are behaving.

In simple terms:
If you want to know where the virus is going, don't just look at the hospital reports. Look at the people. Are they wearing masks? Did they lose their jobs? Are they getting tested? If you combine those answers with smart computer models, you can see the future of the epidemic clearly—even if you have to blur the faces to protect people's privacy.

This approach gives public health officials a "superpower": the ability to see the invisible spread of the virus and make better decisions to keep everyone safe.

1. Problem Statement

During the COVID-19 pandemic, public health surveillance relied heavily on reported case incidence. However, these routine data often lack insight into the underlying behavioral, immunological, and socioeconomic drivers of transmission. While population-based seroprevalence studies (like the MuSPAD study in Germany) offer rich, individual-level data on antibodies, behavior, and exposure, they are rarely utilized to predict population-level disease dynamics.

The study addresses three core challenges:

Prediction Gap: Can aggregated individual-level cohort data (serology + surveys) predict local 7-day COVID-19 incidence rates better than routine time-series data alone?
Interpretability: Which specific behavioral and demographic factors drive these predictions, and can they be identified using Explainable AI (XAI)?
Privacy-Utility Trade-off: How does applying Differential Privacy (DP) to protect individual data affect both the predictive accuracy and the stability of the model's interpretability?

2. Methodology

Data Source and Pre-processing

Dataset: The study utilized data from the MuSPAD (Multilocal SeroPrevalence) study in Germany (2020–2022), covering >32,000 participants across eight regions.
Features: Individual-level data included serological measurements (antibodies), questionnaire responses (household structure, mask-wearing, employment changes, testing history), and daily 7-day incidence rates from the Robert Koch Institute (RKI) served as the target label.
Aggregation: Individual records were aggregated to the daily population level. Categorical variables were one-hot encoded, and missing values were handled via missForest imputation. The feature space was reduced from 704 to 77 (later expanded to 122 after encoding) variables.
Privacy: The authors implemented Differentially Private Stochastic Gradient Descent (DP-SGD) using Rényi Differential Privacy (RDP) to train models with varying privacy budgets ( $\epsilon$ ).

Machine Learning Models

The authors evaluated two categories of models:

Time-Agnostic Models: Treat each day's features independently.
- LASSO: Regularized linear regression for feature selection and interpretability.
- MLP (Multilayer Perceptron): A fully connected neural network (4 hidden layers).
Time-Aware Models: Explicitly leverage temporal dependencies.
- VAR (Vector Autoregression): Sparse estimation with hierarchical lagged structures.
- LSTM (Long Short-Term Memory): A deep learning model with stateful hidden states to capture long-term temporal dependencies.

Explainability (XAI)

To interpret "black-box" models and validate linear ones, the study employed:

Regression Coefficients: For LASSO and VAR.
LIME (Local Interpretable Model-agnostic Explanations): Fits local surrogate models.
SHAP (SHapley Additive exPlanations): Uses game-theoretic values for global feature attribution.
Clustering: Data was clustered into high- and low-incidence groups to analyze feature importance within specific epidemiological regimes.

3. Key Results

Predictive Performance

Superiority of Cohort Data: Incorporating MuSPAD features significantly improved prediction accuracy compared to baselines using only time or past incidence.
Best Performer: The LSTM model with MuSPAD features achieved the lowest test error (RMSE: 4.36, SMAPE: 0.37), outperforming both time-agnostic models and time-aware baselines. It successfully captured major waves (e.g., April 2021) that baselines missed.
Time-Agnostic Performance: LASSO and MLP also showed strong performance, with LASSO slightly outperforming MLP in the test set due to better regularization against overfitting in the limited dataset.
VAR Performance: VAR models showed lag-dependent performance; models with longer lags ( $p=14, 21$ ) performed better than short lags ( $p=7$ ).

Key Predictors (Interpretability)

Across all models and explainability methods, the following factors were consistently identified as top drivers:

Testing History: "Tested PCR positive," "Serology status Infected," and "Missing PCR information" were critical. Notably, missing reporting data correlated with higher predicted incidence, suggesting reporting behavior itself is a signal.
Mask-Wearing: "No mask at restaurant" was a strong positive predictor of incidence in high-incidence clusters. Conversely, general mask-wearing variables often showed complex patterns, likely reflecting behavioral responses to rising cases rather than direct risk.
Employment: Changes in employment status (e.g., job loss, leave) were significant predictors, acting as proxies for socioeconomic disruption and Non-Pharmaceutical Intervention (NPI) impacts.
Immunity:
- Non-temporal models (LASSO/MLP): Antibody presence generally showed a negative association with incidence (protective effect).
- Temporal models (VAR/LSTM): Immunity variables often showed positive associations or lower importance, suggesting these models captured the correlation between past high incidence (leading to immunity) and current incidence, rather than a direct protective causal link.

Differential Privacy Analysis

Performance Impact: As the privacy budget $\epsilon$ decreased (stricter privacy), training RMSE increased monotonically. However, moderate privacy budgets ( $\epsilon=4, 8$ ) showed a regularization effect, sometimes yielding validation errors lower than the non-private baseline. Severe privacy constraints ( $\epsilon=1$ ) caused significant performance degradation.
Interpretability Stability:
- SHAP: Feature importance values remained relatively stable across privacy budgets. The global averaging mechanism of SHAP smoothed out DP-induced noise.
- LIME: Feature importance became increasingly unstable and declined as privacy noise increased. LIME's reliance on local perturbations made it highly sensitive to the noise injected by DP.

4. Key Contributions

Novel Data Integration: Demonstrated that aggregated seroprevalence cohort data, typically used for retrospective analysis, can effectively predict future incidence rates when combined with ML.
Behavioral Signal Extraction: Identified that behavioral signals (masking, testing habits, employment changes) and reporting completeness are as predictive as serological status, filling a gap left by routine surveillance.
Privacy-Interpretability Framework: Provided a systematic analysis of how DP affects XAI. The finding that SHAP is more robust to DP noise than LIME offers a critical guideline for deploying privacy-preserving models in digital epidemiology.
Model Comparison: Highlighted the distinct ways time-aware vs. time-agnostic models interpret immunity, cautioning against the misinterpretation of correlations in temporal models.

5. Significance and Implications

Public Health Surveillance: This study validates the utility of integrating population-based cohort surveys into routine surveillance systems. It suggests that "soft" data (behavior, testing compliance) can provide early warnings or better context for hard incidence data.
Policy Design: The identification of specific drivers (e.g., restaurant masking, employment shifts) offers actionable insights for targeted interventions.
Ethical AI: The work proves that high-utility predictive models can be trained on sensitive health data with strong privacy guarantees (DP) without completely sacrificing interpretability, provided the right XAI methods (like SHAP) are chosen.
Future Directions: The authors suggest extending this framework to include spatial information and exploring other privacy-preserving techniques to further enhance digital epidemiology.

Limitations: The study relies on cross-sectional sampling rather than longitudinal follow-up of individuals, which limits the disentanglement of cohort vs. temporal effects. Additionally, the findings are predictive correlations, not causal proofs.