📊 epidemiology

A bootstrap particle filter for viral Rt inference and forecasting using wastewater data

This paper presents a lightweight, statistically rigorous bootstrap particle filter framework that integrates wastewater, case incidence, and serological data within a state-space model to accurately infer and forecast time-varying effective reproduction numbers (Rt) while overcoming challenges related to missing data, irregular sampling, and parameter unidentifiability.

Original authors: Xiao, W. F., Wang, Y., Goel, N., Wolfe, M., Koelle, K.

Published 2026-03-06

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Xiao, W. F., Wang, Y., Goel, N., Wolfe, M., Koelle, K.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to figure out how fast a fire is spreading through a forest, but you can't see the trees. You only have two clues:

Smoke signals: People calling in to say they see smoke (this is like case data).
Ash in the river: You find ash floating downstream from the forest (this is like wastewater data).

The authors of this paper, led by Katia Koelle, have built a new, clever "detective tool" (a Bootstrap Particle Filter) to figure out the speed of the fire (the Effective Reproduction Number, or $R_t$ ) using these clues.

Here is the breakdown of their work in simple terms:

1. The Problem: Missing Pieces and Foggy Clues

Scientists have been using wastewater to track viruses (like SARS-CoV-2) because it's like a "community nose" that smells the virus before people get sick enough to go to the doctor. However, turning that smell into a number (how fast the virus is spreading) is hard.

The Missing Data Problem: Sometimes the river is dry, or the phone lines are down. Old methods often had to guess (impute) the missing numbers, which can be messy.
The Foggy Clue Problem: If you only look at the ash in the river, you don't know if there is a tiny fire with a lot of ash, or a huge fire with very little ash. You can't tell the difference. This is called "unidentifiability."

2. The Solution: A "Guess-and-Check" Simulator

The authors created a digital simulation of the virus spreading. To solve the mystery, they use a method called a Bootstrap Particle Filter.

The Analogy: The Army of Explorers
Imagine you have 1,000 tiny explorers (particles). Each explorer has a slightly different theory about how the fire is spreading:

Explorer #1 thinks the fire is small but spreading fast.
Explorer #2 thinks the fire is huge but spreading slow.
Explorer #3 thinks the wind is blowing the ash in a weird direction.

Every time a new piece of data comes in (a new wastewater sample or a new case report), the explorers check their theories against the new evidence.

If Explorer #1's theory matches the new data, they get a "high score" (weight).
If Explorer #4's theory is way off, they get a "low score."

The computer then throws away the low-scoring explorers and makes copies of the high-scoring ones. Over time, the army of explorers converges on the most likely reality. This allows them to fill in the gaps even when data is missing, without having to guess.

3. The Big Discovery: The "Wind" in the River

When they tested this on real data from Zurich, Switzerland, they hit a snag. The wastewater data was too "noisy." It jumped up and down wildly from day to day, even when the number of sick people wasn't changing that much.

The Analogy: The Rainy Day
They realized the river wasn't just carrying ash; it was being buffeted by the wind and rain. Sometimes it rained, washing out more ash than usual. Sometimes it was dry, and the ash settled.

The Fix: They added a "Wind Factor" (environmental noise) to their model. This allowed the model to say, "Hey, the ash concentration jumped today, but it's probably just because it rained, not because the fire suddenly got 100 times bigger."
The Result: Once they accounted for the "wind," the model could finally see the true shape of the fire (the virus spread) clearly.

4. The Final Piece: The Serology Puzzle

Even with the wind factor, there was still a mystery. The model could tell them how fast the virus was spreading, but it couldn't tell them how many people were actually infected versus how many were just getting tested.

The Analogy: It's like knowing the fire is spreading at 5 mph, but not knowing if the forest has 10 trees or 10,000 trees.

To solve this, they brought in Serological Data (blood tests that show who has ever been infected).

The Analogy: Imagine finding a map of the forest that shows exactly how many trees were burned last month.
The Result: By comparing their "explorers" to this map, they could finally pin down the exact numbers: How many people were actually sick, and how many were just hiding in the shadows (unreported cases).

5. Why This Matters

This tool is like a crystal ball for public health.

It's Fast: It runs in seconds.
It's Flexible: It works whether you have perfect data or messy, missing data.
It Predicts: Once the model understands the current situation, it can look 10 days into the future and say, "If things stay this way, we expect between 50 and 150 new cases next week."

In a nutshell: The authors built a smart, adaptable computer program that combines wastewater data, case reports, and blood tests to see the invisible spread of viruses. They figured out how to ignore the "noise" (like rain washing away ash) so public health officials can make better decisions to stop the fire before it burns out of control.

1. Problem Statement

Wastewater-based epidemiology (WBE) has emerged as a critical tool for infectious disease surveillance, yet existing statistical methods for inferring epidemiological dynamics from wastewater data face several limitations:

Data Integration: Few approaches allow for the systematic integration of multiple data streams (e.g., combining wastewater data with case incidence or serological data).
Data Handling: Many existing methods require data imputation to handle missing values and struggle with datasets containing irregular sampling intervals.
Computational Efficiency: Some methods are computationally intensive, limiting their utility for real-time forecasting.
Parameter Identifiability: There is a fundamental challenge in reconstructing underlying infection dynamics (e.g., true infection prevalence) and estimating specific parameters (like reporting rates or shedding loads) solely from wastewater or case data due to structural unidentifiability.

The authors aim to develop a statistically rigorous, lightweight approach to infer and forecast the time-varying effective reproduction number ( $R_t$ ) using wastewater data, either alone or in combination with other data streams, while addressing missing data and parameter identifiability issues.

2. Methodology

The authors propose a State-Space Modeling framework coupled with a Bootstrap Particle Filter.

A. The Process Model (Semi-Mechanistic)

The model simulates the dynamics of infected individuals, cumulative case incidence, and wastewater virus concentrations.

Compartmentalization: Infected individuals are partitioned into $n$ sequential compartments ( $I_0$ to $I_{n-1}$ ). Individuals transition between compartments at a rate $n\nu$ , representing the time-since-infection distribution.
Dynamics:
- Infection Incidence ( $\Psi(t)$ ): Driven by the time-varying effective reproduction number ( $R_t$ ) and an infectivity profile ( $f_i$ ).
- Case Incidence ( $C(t)$ ): Modeled as a function of the reporting rate ( $\rho$ ) and a case detection profile ( $c_i$ ).
- Wastewater Concentration ( $W(t)$ ): Modeled as a function of the shedding load scaling constant ( $\lambda$ ), a shedding profile ( $\omega_i$ ), and a viral decay/outflow rate ( $\delta$ ).
$R_t$ Evolution: $R_t$ is modeled as a separate state variable evolving via Brownian motion ( $dR_t = \sigma_B dB(t)$ ), allowing for stochastic fluctuations over time.
Environmental Noise: To address high variability in wastewater data, the authors introduced a stochastic component to the viral outflow rate ( $\delta$ ), adding an environmental noise term ( $\epsilon$ ) to the wastewater differential equation.

B. The Observation Models

Case Data: Modeled using a Negative Binomial distribution to account for overdispersion in reported cases.
Wastewater Data: Modeled using a Gamma distribution to handle virus concentration measurements.
Missing Data: The particle filter naturally handles missing data points by skipping likelihood calculations for those specific time steps, eliminating the need for imputation.

C. The Bootstrap Particle Filter

Mechanism: The filter maintains a set of "particles" (simulations of the state space model). At each observation time, particles are weighted based on the likelihood of the observed data given the particle's predicted state.
Resampling: Particles are resampled with replacement based on their weights to focus computational effort on high-probability trajectories.
Inference: The marginal log-likelihood is calculated to evaluate parameter sets. The filter reconstructs latent variables (like $R_t$ and infection prevalence) by drawing from the final set of particles.

D. Data Sources

Mock Datasets: Simulated data with known ground-truth parameters to test the filter's ability to recover $R_t$ and identify parameters.
Real-World Data: SARS-CoV-2 data from Zurich, Switzerland (Sept 2020 – Jan 2021), including daily case incidence and twice-weekly wastewater RNA loads (N1 and N2 assays).
Serological Data: Seroprevalence estimates from blood donors in Zurich used to resolve parameter unidentifiability.

3. Key Contributions

Unified Framework: Development of a lightweight state-space model that jointly infers $R_t$ from wastewater, case, and serological data without requiring data imputation.
Handling Environmental Stochasticity: Demonstration that incorporating environmental noise into the viral outflow process significantly improves model fit for wastewater data, which often exhibits high-frequency variability not explained by transmission dynamics alone.
Solving Identifiability: Showing that while $R_t$ can be estimated from single data streams, underlying infection dynamics and parameters (reporting rate $\rho$ , shedding constant $\lambda$ ) are structurally unidentifiable. The authors demonstrate that serological data can break this unidentifiability, allowing for the reconstruction of true infection dynamics.
Forecasting Capability: The method enables short-term forecasting (10-day horizon) of $R_t$ , case incidence, and wastewater concentrations.

4. Results

Mock Dataset Analysis

Single Data Streams: When using only case data, the reporting rate ( $\rho$ ) was unidentifiable (a ridge of likelihoods existed across different $\rho$ values), yet $R_t$ was accurately reconstructed. Similarly, using only wastewater data, the shedding constant ( $\lambda$ ) was unidentifiable, but $R_t$ was recovered.
Joint Data Streams: Jointly fitting case and wastewater data constrained the parameter space (only specific combinations of high $\rho$ /high $\lambda$ or low $\rho$ /low $\lambda$ yielded high likelihoods) but did not fully resolve the unidentifiability of the individual parameters or the underlying infection prevalence.
Brownian Motion Strength ( $\sigma_B$ ): An optimal intermediate value for $\sigma_B$ was required to capture the amplitude of $R_t$ fluctuations without introducing excessive jaggedness.

Zurich SARS-CoV-2 Analysis

Environmental Noise: Initial models without environmental noise failed to capture the high-frequency variability in wastewater data, resulting in unrealistic $R_t$ estimates (excessively jagged). Adding the environmental noise term ( $\epsilon$ ) to the viral outflow rate improved the fit and aligned $R_t$ estimates with those derived from case data.
Parameter Identifiability: Joint fitting of case and wastewater data still resulted in a ridge of high-likelihood parameter combinations for $\rho$ and $\lambda$ .
Role of Serology: By incorporating seroprevalence changes (increase of ~4.3% between Sept and Dec 2020) as a constraint, the authors successfully identified the specific parameter combination: $\rho \approx 0.28$ (28% reporting rate) and $\lambda \approx 3.1 \times 10^{10}$ copies/mL. This resolved the underlying infection dynamics.
Forecasting: The model successfully forecasted $R_t$ , daily cases, and wastewater concentrations for a 10-day period. The observed case counts fell within the forecasted bounds, demonstrating the utility of the approach for public health planning.

5. Significance and Implications

Public Health Utility: The approach provides a robust, computationally efficient tool for public health practitioners to monitor disease transmission and forecast trends using wastewater data, even when case reporting is incomplete or irregular.
Methodological Advancement: By moving away from discrete-time convolution methods (which struggle with missing data) to a particle filter framework, the authors offer a more flexible solution that can integrate heterogeneous data sources.
Noise Modeling: The study highlights the critical importance of distinguishing between transmission-driven noise (modeled by $R_t$ ) and environmental noise (modeled by wastewater outflow variability) to avoid misinterpreting wastewater fluctuations as changes in transmission dynamics.
Data Integration: The work underscores that while wastewater data is powerful, combining it with serological data is often necessary to fully reconstruct the "true" state of an epidemic, particularly for estimating reporting rates and total infection prevalence.

Limitations: The model assumes a constant shedding profile and does not account for demographic stochasticity or variant-specific differences in shedding. It also relies on Brownian motion for $R_t$ evolution, which may not capture abrupt changes caused by immediate policy interventions.