📊 epidemiology

Uncovering identifiability of epidemiological models: basic reproduction number and complementary data streams

This study demonstrates that while individual parameters in epidemiological models may not be uniquely identifiable, the basic reproduction number often is, and that adding minimal complementary data can render otherwise non-identifiable models globally identifiable, thereby shifting the focus of public health surveillance toward ensuring the identifiability of decision-relevant quantities rather than complete model identifiability.

Original authors: Pant, B., Saucedo, O., Pogudin, G.

Published 2026-01-25

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Pant, B., Saucedo, O., Pogudin, G.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: How fast is a disease spreading, and how dangerous is it?

To solve this, you build a mathematical "simulation" of the disease. You feed it data you can see, like the number of people reporting to the hospital each day. But here's the catch: the data you see is just the tip of the iceberg. You can't see everyone who is sick, how many are immune, or exactly how many people are walking around carrying the virus.

This paper asks a fundamental question: Even if we had perfect, crystal-clear data (no errors, no missing numbers), could our mathematical model actually figure out the true rules of the disease?

In the world of math, this is called structural identifiability. If a model isn't "identifiable," it's like a lock with two different keys that open it perfectly. You can't tell which key is the "real" one, so you can't know the true rules of the disease.

Here is what the authors discovered, explained through simple analogies:

1. The "Blind Spot" Problem

Usually, scientists assume that if they can't figure out every single number in their model (like the exact speed of transmission or the exact population size), the whole model is useless.

The authors say: Not so fast.

Think of a recipe for a cake. If you only taste the frosting, you might not know exactly how much sugar or flour went into the batter. You can't identify the individual ingredients. However, you can still know exactly how sweet the cake is.

The paper shows that even when a model is "broken" and can't tell you the exact values of every single parameter, it can often still tell you the Basic Reproduction Number ( $R_0$ ).

What is $R_0$ ? It's the average number of people one sick person infects. It's the most important number for deciding if an outbreak will explode or die out.
The Finding: In almost every type of disease model they tested (from simple flu models to complex mosquito-borne diseases), the model could correctly identify $R_0$ even if it couldn't figure out the individual ingredients (like the exact transmission rate or population size).
The Takeaway: You don't need to solve the entire puzzle to know if the fire is going to spread. You just need to know the "spread factor."

2. The "One Clue" Miracle

What happens if the model is stuck and can't figure out the spread factor? The authors found a surprisingly simple fix.

Imagine you are trying to guess the height of a tree, but you only have a blurry photo of its shadow. You can't tell the height. But if someone hands you one single, perfect measurement of the tree's trunk at a specific moment, you can instantly calculate the exact height.

The paper proves that for many complex models, adding just one single data point from a different source can unlock the whole mystery.

Example: If you only track daily deaths, the model might be confused about whether the disease is highly contagious but mild, or not very contagious but deadly.
The Fix: If you add one single measurement of how many people are currently sick (or recovered) at a specific time, the model suddenly becomes "globally identifiable." It can now figure out everything correctly.
The Takeaway: Public health officials don't need to spend millions tracking five different data streams constantly. They might get better results by spending resources to get one high-quality, perfect measurement from a different angle (like a blood test survey) at a key moment.

3. The "Shape" Matters

The authors also found that the shape of the model matters.

Most Models: Whether it's a simple flu model, a cholera model with water transmission, or a mosquito model, the "spread factor" ( $R_0$ ) is usually easy to find, even if the rest is hard.
The Exception: There was one tricky model where people change their behavior (like wearing masks when they see many sick people). In this specific case, the "spread factor" was hard to pin down perfectly. It's like a chameleon that changes color so fast you can't get a clear picture of its true color.

Summary of the Paper's Message

The paper challenges the old way of thinking. Instead of asking, "Is our whole model perfect and identifiable?" we should ask, "Can we identify the specific number that matters for our decision?"

Good News: Even if the model is "broken" regarding individual details, it often still works perfectly for the most important decision-making number ( $R_0$ ).
Better Strategy: If the model is stuck, you don't need more data everywhere. You just need one tiny, perfect piece of extra information from a different source to fix the whole system.

In short: You don't need to see the whole forest to know if it's on fire; sometimes, just seeing one spark is enough to tell you everything you need to know.

Technical Summary: Uncovering Identifiability of Epidemiological Models

Problem Statement
Mathematical models of infectious disease dynamics are routinely fitted to surveillance data to estimate epidemiological parameters and inform public health decisions. However, a fundamental challenge exists: observable data often captures only a partial view of the epidemic process. Even under ideal conditions (perfect, noise-free, continuous observations), the model structure and available measurements may not contain sufficient information to uniquely determine all parameters. This limitation is formally characterized by structural identifiability. A model that fails this test cannot yield unique parameter estimates regardless of data quality or quantity.

While structural identifiability is a necessary precondition for reliable inference, current epidemiological practice often overlooks two critical nuances:

Composite Quantities: A model may be structurally nonidentifiable regarding individual parameters, yet still yield identifiable composite quantities (e.g., the basic reproduction number, $R_0$ ) or observable state variables that are crucial for decision-making.
Minimal Complementary Data: There is a lack of systematic understanding regarding the minimal amount of additional data required from complementary streams to resolve structural non-identifiability.

Methodology
The authors employ a systematic approach combining differential algebra, numerical optimization, and theoretical proofs to investigate structural identifiability across diverse classes of compartmental epidemic models.

Differential Algebra Approach: The primary analytical tool is the differential algebra approach, implemented via the software package StructuralIdentifiability.jl. This method involves:
1. Eliminating state variables to derive input-output equations involving only observations and parameters.
2. Checking if the coefficients of these equations form a complete set of identifiable parameter combinations.
3. Determining if individual parameters or state variables can be expressed in terms of these combinations.
4. Using Lie derivatives to assess the identifiability of state variables.
  Note: The authors acknowledge that while some steps in the software are randomized, they provide rigorous mathematical proofs or use deterministic tools (like DAISY) for verification where necessary.
Numerical Visualization: To complement theoretical analysis, the authors utilize multi-start local optimization in MATLAB. They fit models to simulated, noise-free complete outbreak data from multiple initial parameter guesses. Non-uniqueness in the optimization landscape (multiple distinct parameter sets yielding equivalent fits) serves as a visual indicator of non-identifiability.
Model Classes: The study covers a broad spectrum of epidemiological models, including:
- Standard SIR models with standard incidence.
- Models with gamma-distributed latent (SEnIR) and infectious (SInR) periods.
- Models with fractional and additive mortality (SIRD).
- Models with multiple transmission pathways (Cholera: human-to-human and waterborne).
- Stratified population models with aggregated observations.
- Host-vector dynamics (Ross-Macdonald model).
- Behavioral response models with nonlinear incidence.

Key Contributions and Results

1. Identifiability of Composite Quantities ( $R_0$ ) Despite Non-Identifiable Parameters
The study demonstrates that structural non-identifiability of a model does not preclude the identifiability of epidemiologically relevant composite quantities.

SIR Model: While the transmission rate ( $\beta$ ) and population size ( $N$ ) are nonidentifiable (only their ratio $\beta/N$ is identifiable), the basic reproduction number $R_0$ and the effective reproduction number $R_e(t)$ are globally identifiable.
SIRD Model: In models with infection fatality ratios, individual parameters like transmission rate and fatality ratio are nonidentifiable, but $R_0$ remains globally identifiable.
Cholera and Host-Vector Models: Despite complex transmission pathways and nonidentifiable individual parameters (e.g., specific transmission rates or population sizes), $R_0$ is shown to be globally structurally identifiable because it decomposes into products and quotients of identifiable combinations.
Exception: The authors identify a counterexample in behavioral response models with a saturating incidence function ( $k=1$ ). In this specific case, $R_0$ is only locally identifiable, not globally, challenging the assumption that $R_0$ is always identifiable.

2. Minimal Complementary Data for Global Identifiability
The paper proves that incorporating minimal complementary data can resolve structural non-identifiability, often requiring as little as a single time-point measurement from an additional state variable.

SIR Model: Adding a single noise-free data point for the recovered population $R(t)$ renders the entire model globally identifiable.
SIRD Model: A single observation of $S(t)$ , $I(t)$ , or $R(t)$ resolves the non-identifiability caused by the correlation between transmission rate and fatality ratio.
Cholera Model: The data requirements depend on the observation type. Observing pathogen concentration ( $W$ ) requires two complementary data points from distinct compartments to achieve global identifiability, whereas observing pathogen inflow ( $\alpha I$ ) requires only a single data point.
Behavioral Models: A single observation of $S(t)$ or $R(t)$ converts the locally identifiable behavioral model into a globally identifiable one.

3. State Observability vs. Parameter Identifiability
The study highlights that global identifiability of parameters does not guarantee global identifiability of state variables. For instance, in a stratified SIR model where two subpopulations share identical parameters, the aggregate parameters are identifiable, but the individual state variables of the subpopulations are not; only their sums are identifiable.

Significance and Claims
The authors frame their findings as a shift in perspective for epidemiological modeling and surveillance design:

Reframing the Identifiability Question: The central question should move from "Is the model identifiable?" to "Are the quantities that matter for decision-making identifiable?" A structurally nonidentifiable model may still be "adequate for purpose" if the specific composite quantities required for public health decisions (like $R_0$ ) are identifiable.
Surveillance Design Implications: Rather than allocating limited resources to frequent monitoring of multiple data streams or relying on uncertain external parameter estimates, public health systems can strategically prioritize collecting high-quality, sparse complementary measurements. The study suggests that even a single, noise-free data point from a complementary stream (e.g., a seroprevalence survey providing a snapshot of the susceptible population) can achieve global identifiability.
Theoretical Limits: The paper acknowledges the difficulty in deriving general identifiability results for nonlinear models. While they establish global identifiability for gamma-distributed latent periods, they note that patterns for behavioral response models remain elusive and sensitive to structural details (e.g., the exponent $k$ ).

In conclusion, the study provides a methodological foundation for assessing what can be reliably inferred from epidemiological models, demonstrating that useful inference is possible even when full model identifiability is unattainable, and that minimal complementary data can often restore full identifiability.

1. The "Blind Spot" Problem

2. The "One Clue" Miracle

3. The "Shape" Matters

Summary of the Paper's Message

Technical Summary: Uncovering Identifiability of Epidemiological Models

More like this