A capture-recapture hidden Markov model framework for register-based inference of population size and dynamics

This paper proposes a scalable hidden Markov model framework based on capture-recapture principles to accurately infer population size and dynamics from incomplete register data by simultaneously accounting for both false negative and false positive observation errors.

Lucy Y Brown, Eleni Matechou, Bruno Santos, Eleonora Mussino

Published 2026-03-27
📖 5 min read🧠 Deep dive

Imagine you are trying to count the number of people living in a bustling city, but you don't have a census taker going door-to-door. Instead, you have to guess the population size by looking at a pile of scattered receipts, utility bills, and library cards left behind by the residents.

This is the challenge faced by governments using administrative registers (digital records of things like taxes, jobs, and marriages) instead of traditional censuses. The problem? These records are messy. Sometimes people are in the city but leave no paper trail (a false negative). Other times, a person has moved away, but their name stays on a bill because their spouse is still paying it, making it look like they are still there (a false positive).

This paper presents a new, super-smart way to solve this puzzle using a statistical framework called a Capture-Recapture Hidden Markov Model. Here is how it works, explained with some everyday analogies.

1. The "Ghost in the Machine" Problem

Think of the population as a group of actors on a stage.

  • The Actors: The real people.
  • The Spotlight: The administrative registers (tax records, job records, etc.).
  • The Problem: Sometimes an actor is on stage, but the spotlight misses them (False Negative). Sometimes the spotlight shines on an empty spot because a prop was left behind (False Positive).

Traditional methods often just count who is under the spotlight. If they aren't there, they assume the person isn't there. This leads to wrong population counts.

2. The New Solution: A Detective's Notebook (Hidden Markov Models)

The authors propose treating this like a detective story. Instead of just looking at who is under the spotlight right now, they build a model that guesses what the actors are doing behind the scenes.

They use a Hidden Markov Model (HMM). Imagine a detective who knows the rules of the game:

  • If a person is "Present," they usually show up at work or pay taxes.
  • If a person is "Abroad," they usually don't show up, unless they left a bill behind (the false positive).
  • If a person is "Dead," they stop showing up entirely.

The model doesn't just look at the current evidence; it looks at the history. Did this person appear last year? Did they disappear suddenly? Did they reappear? By connecting the dots over time, the model can deduce: "Ah, this person hasn't paid taxes for two years, but their name is still on the family electricity bill. They probably moved away but didn't cancel the account. They are actually 'Abroad,' not 'Present'."

3. The "Shape-Shifting" Crowd (Individual Heterogeneity)

Not everyone leaves the same paper trail. A student might show up in "University Records" but not "Tax Records." A retiree might show up in "Pension Records" but not "Job Records."

The paper introduces a Finite Mixture Model. Think of this as sorting the crowd into different "personas" or "types."

  • Type A: The "Active Worker" who leaves a trail in almost every register.
  • Type B: The "Quiet Resident" who only shows up in family income records.

The model learns that just because someone isn't in the "Job" register, it doesn't mean they are gone; they might just be "Type B." This prevents the model from accidentally counting active workers as missing people.

4. The "Super-Computer" Trick (Bag of Little Bootstraps)

The dataset is massive—over 700,000 people tracked over 14 years. Running complex math on this much data usually takes forever, like trying to count every grain of sand on a beach one by one.

The authors use a technique called Bag of Little Bootstraps (BLB).

  • The Analogy: Imagine you want to know the average weight of all the apples in a giant warehouse. Instead of weighing every single apple (which takes days), you grab a few small handfuls (subsets), weigh them, and then use a clever math trick to simulate what would happen if you weighed the whole warehouse a thousand times.
  • This allows them to calculate the "margin of error" (how confident they are in their numbers) without needing a supercomputer to run for a month.

5. The Swedish Case Study: The "Over-Count" Mystery

The authors tested this on Swedish data. Sweden has a great system where everyone must register to live there. But, many people move away and forget to "de-register."

  • The Old Way: Count everyone still on the list. Result: You think there are more people living there than there actually are (Overcoverage).
  • The New Way: The model spots the "ghosts." It sees people who haven't worked or paid taxes in years but are still on the family income list. It correctly identifies them as "Abroad."

The Result: The model found that the "Overcoverage" (people on the list who aren't actually there) was higher than previously thought, especially for certain groups like people from Denmark or Norway who move back and forth frequently.

Why Does This Matter?

If a government thinks there are 100,000 people in a city but there are actually only 90,000, they might build too many schools or hospitals, wasting money. Or, if they think there are fewer people than there are, they might not provide enough resources.

This paper gives statisticians a powerful, flexible lens to see through the noise of administrative data. It separates the real people from the "ghosts" left behind by bureaucracy, giving us a much clearer picture of how populations really move, grow, and shrink.

In short: It's like upgrading from a blurry, static photo of a crowd to a high-definition, 3D movie that tracks every individual's movement, even when they try to hide in the shadows.