A capture-recapture hidden Markov model framework for register-based inference of population size and dynamics

Imagine you are trying to count the number of people living in a bustling city, but you don't have a census taker going door-to-door. Instead, you have to guess the population size by looking at a pile of scattered receipts, utility bills, and library cards left behind by the residents.

This is the challenge faced by governments using administrative registers (digital records of things like taxes, jobs, and marriages) instead of traditional censuses. The problem? These records are messy. Sometimes people are in the city but leave no paper trail (a false negative). Other times, a person has moved away, but their name stays on a bill because their spouse is still paying it, making it look like they are still there (a false positive).

This paper presents a new, super-smart way to solve this puzzle using a statistical framework called a Capture-Recapture Hidden Markov Model. Here is how it works, explained with some everyday analogies.

1. The "Ghost in the Machine" Problem

Think of the population as a group of actors on a stage.

The Actors: The real people.
The Spotlight: The administrative registers (tax records, job records, etc.).
The Problem: Sometimes an actor is on stage, but the spotlight misses them (False Negative). Sometimes the spotlight shines on an empty spot because a prop was left behind (False Positive).

Traditional methods often just count who is under the spotlight. If they aren't there, they assume the person isn't there. This leads to wrong population counts.

2. The New Solution: A Detective's Notebook (Hidden Markov Models)

The authors propose treating this like a detective story. Instead of just looking at who is under the spotlight right now, they build a model that guesses what the actors are doing behind the scenes.

They use a Hidden Markov Model (HMM). Imagine a detective who knows the rules of the game:

If a person is "Present," they usually show up at work or pay taxes.
If a person is "Abroad," they usually don't show up, unless they left a bill behind (the false positive).
If a person is "Dead," they stop showing up entirely.

The model doesn't just look at the current evidence; it looks at the history. Did this person appear last year? Did they disappear suddenly? Did they reappear? By connecting the dots over time, the model can deduce: "Ah, this person hasn't paid taxes for two years, but their name is still on the family electricity bill. They probably moved away but didn't cancel the account. They are actually 'Abroad,' not 'Present'."

3. The "Shape-Shifting" Crowd (Individual Heterogeneity)

Not everyone leaves the same paper trail. A student might show up in "University Records" but not "Tax Records." A retiree might show up in "Pension Records" but not "Job Records."

The paper introduces a Finite Mixture Model. Think of this as sorting the crowd into different "personas" or "types."

Type A: The "Active Worker" who leaves a trail in almost every register.
Type B: The "Quiet Resident" who only shows up in family income records.

The model learns that just because someone isn't in the "Job" register, it doesn't mean they are gone; they might just be "Type B." This prevents the model from accidentally counting active workers as missing people.

4. The "Super-Computer" Trick (Bag of Little Bootstraps)

The dataset is massive—over 700,000 people tracked over 14 years. Running complex math on this much data usually takes forever, like trying to count every grain of sand on a beach one by one.

The authors use a technique called Bag of Little Bootstraps (BLB).

The Analogy: Imagine you want to know the average weight of all the apples in a giant warehouse. Instead of weighing every single apple (which takes days), you grab a few small handfuls (subsets), weigh them, and then use a clever math trick to simulate what would happen if you weighed the whole warehouse a thousand times.
This allows them to calculate the "margin of error" (how confident they are in their numbers) without needing a supercomputer to run for a month.

5. The Swedish Case Study: The "Over-Count" Mystery

The authors tested this on Swedish data. Sweden has a great system where everyone must register to live there. But, many people move away and forget to "de-register."

The Old Way: Count everyone still on the list. Result: You think there are more people living there than there actually are (Overcoverage).
The New Way: The model spots the "ghosts." It sees people who haven't worked or paid taxes in years but are still on the family income list. It correctly identifies them as "Abroad."

The Result: The model found that the "Overcoverage" (people on the list who aren't actually there) was higher than previously thought, especially for certain groups like people from Denmark or Norway who move back and forth frequently.

Why Does This Matter?

If a government thinks there are 100,000 people in a city but there are actually only 90,000, they might build too many schools or hospitals, wasting money. Or, if they think there are fewer people than there are, they might not provide enough resources.

This paper gives statisticians a powerful, flexible lens to see through the noise of administrative data. It separates the real people from the "ghosts" left behind by bureaucracy, giving us a much clearer picture of how populations really move, grow, and shrink.

In short: It's like upgrading from a blurry, static photo of a crowd to a high-definition, 3D movie that tracks every individual's movement, even when they try to hide in the shadows.

1. Problem Statement

Accurate estimation of population size and dynamics (migration, mortality) is critical for policymaking but is increasingly difficult due to the shift from traditional censuses to administrative register-based systems. While registers offer timely data, they suffer from two primary sources of observation error that standard methods fail to address simultaneously:

False Negatives: Individuals present in the population may not generate records in specific registers during a given period (e.g., unemployed individuals not appearing in employment registers).
False Positives: Individuals absent from the population (e.g., emigrants) may still appear in registers due to administrative artifacts or household-level processes (e.g., a spouse's income appearing in a "family income" register).
Temporary Emigration: Individuals may leave the country and return, creating gaps in observation that do not imply permanent departure.

Existing approaches have significant limitations:

"Sign of Life" / Register-Trace approaches: Rely on ad-hoc rules (e.g., excluding those with zero income) and cannot distinguish between temporary absence and permanent emigration, nor do they account for false positives.
Multiple Systems Estimation (MSE): Provides population snapshots (annual) but cannot infer longitudinal dynamics or individual trajectories.
Standard Capture-Recapture (CR): Often assumes closed populations or lacks the flexibility to model complex observation errors and unobserved heterogeneity in open populations.
Previous CR-HMM attempts (e.g., Santos et al., 2024): Used Bayesian MCMC which is computationally infeasible for full-population datasets and did not model false positive errors.

2. Methodology

The authors propose a unified Capture-Recapture Hidden Markov Model (CR-HMM) framework that integrates open-population dynamics with flexible observation modeling.

A. Latent State Process (HMM)

The model treats the true status of an individual as a latent state $Z_{it}$ evolving over time $t$ . The state space includes:

Present: Alive and in the study area.
Abroad: Alive but temporarily emigrated.
Dead: Absorbing state.

Transitions: Governed by a transition matrix $\Gamma_{it}$ parameterized by survival ( $s_{it}$ ), emigration ( $e_{it}$ ), re-immigration ( $r_{it}$ ), and de-registration ( $\lambda_{it}$ ) probabilities. These are modeled via logistic regression to incorporate covariates (sex, age, country of birth, time in Sweden).
Scalability: The model uses the Forward Algorithm to marginalize over latent states, avoiding the need to sample individual latent states (unlike Bayesian MCMC), thus enabling inference on full-population datasets.

B. Observation Model

The observation vector $Y_{it}$ (presence in $K$ registers) is generated conditional on the latent state.

False Negatives: Modeled using a Multinomial distribution with a Multicategory Logit link. This allows the probability of observing an individual in specific register-covariate combinations to vary, capturing dependence between registers.
False Positives: Explicitly modeled by allowing individuals in the "Abroad" state to appear in specific registers (e.g., family income) with non-zero probability ( $q_{ijt}$ ).
Unobserved Heterogeneity: A Finite Mixture Model (FMM) is integrated into the observation process. Individuals are assigned to latent classes (e.g., high vs. low employment propensity) to capture heterogeneity not explained by observed covariates.

C. Inference and Uncertainty Quantification

Parameter Estimation: Maximum Likelihood Estimation (MLE) is used, maximizing the log-likelihood derived from the Forward Algorithm.
Scalable Uncertainty: Standard bootstrapping is computationally prohibitive for national datasets ( $N \approx 700,000$ ). The authors employ the Bag of Little Bootstraps (BLB). This method draws small subsets of the data, performs resampling within subsets, and aggregates results, allowing for parallel computation of confidence intervals for complex derived quantities (e.g., population size, overcoverage rates).
Latent State Decoding: The Viterbi Algorithm is used to reconstruct the most probable sequence of latent states for each individual, enabling the classification of individuals as present, absent (known/unknown), or dead.

3. Key Contributions

Unified Framework: First model to jointly address false positives, false negatives, temporary emigration, and unobserved heterogeneity in an open-population CR setting.
Scalability: By combining HMM marginalization (Forward Algorithm) with BLB, the method is computationally feasible for full-population administrative data, overcoming the limitations of Bayesian MCMC approaches.
Flexible Observation Structure: The use of a multicategory logit model within the HMM allows for complex dependence between multiple registers and covariates, bridging the gap between CR and MSE methodologies.
Individual-Level Dynamics: Unlike MSE (which provides annual snapshots), this framework recovers individual trajectories, allowing for the tracking of migration flows and the distinction between genuine presence and administrative artifacts.

4. Case Study and Results

The framework was applied to Swedish administrative data for 721,854 foreign-born adults who entered Sweden between 2003 and 2016. Data included 10 administrative registers (e.g., employment, marriage, education, family income) and migration/death records.

Model Performance:
- False Positives: The model successfully identified that individuals appearing only in the "Family Income" register (due to a spouse's income) were likely absent from the country. The probability of true presence dropped sharply after 1–2 consecutive years of such observation.
- Heterogeneity: The FMM identified two distinct latent groups regarding employment income: one with high labor market attachment and one with low/no attachment. This separation extended to other registers, validating the model's ability to capture unobserved behavioral differences.
- Demographic Insights:
  - Migration: Individuals from Denmark/Norway showed high emigration and low de-registration rates (transnational mobility), while those from Iceland/Finland showed high de-registration rates (due to Nordic registration agreements).
  - Overcoverage: The model estimated overcoverage (individuals registered but absent) to be significantly higher than previous studies that ignored false positives. Overcoverage peaked around 2009–2010 (approx. 12.5%) and varied by demographic group (e.g., highest for Denmark/Norway at ~42% in 2016).
- Comparison: The proposed model yielded higher overcoverage estimates than "Sign of Life" approaches and previous CR models, attributing this to the explicit modeling of false positives. It provided richer dynamic insights than MSE.

5. Significance and Implications

Policy Relevance: The method provides a robust tool for national statistical offices (like Statistics Sweden) to move beyond ad-hoc register rules. It offers a scientifically rigorous way to estimate population size and migration flows, correcting for the "ghost" population of emigrants who fail to de-register.
Methodological Advancement: It demonstrates that complex latent variable models can be scaled to national datasets using modern resampling techniques (BLB) and efficient algorithms (Forward/Viterbi).
Generalizability: While motivated by the Swedish context, the framework is applicable to any country with linked administrative registers. The authors are currently applying it to Norwegian data.
Future Directions: The authors suggest extending the framework to include passive registers (e.g., police, hospital) to reduce undercoverage and developing population-level MSE frameworks that incorporate these CR-HMM components for even greater scalability.

In conclusion, this paper presents a breakthrough in demographic inference, transforming administrative register data from a source of potential bias into a high-resolution tool for understanding population dynamics, provided that observation errors and individual heterogeneity are rigorously modeled.