Handling onset age inconsistencies in longitudinal healthcare survey data

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to build a massive, detailed map of everyone's health history using a survey. You ask people, "When did you first get sick with Condition X?" You ask them again five or ten years later.

Ideally, they should say the same thing both times. But in reality, human memory is like a leaky bucket. Sometimes people forget, sometimes they guess, and sometimes they just get the date wrong. One person might say, "I got diabetes at 45," and five years later, they say, "Oh, I think it was actually 52."

This creates a mess for scientists. If you throw away all the answers that don't match, you lose half your data. If you keep them, your map is full of holes and errors.

This paper is about two clever ways to fix this "leaky bucket" problem without throwing away the water.

The Problem: The "Fuzzy Memory" Effect

The researchers looked at a huge Canadian health study (CanPath) with nearly 100,000 people. They found that 57% of people gave different answers about when their diseases started when asked again later. It's like asking someone, "What year did you graduate high school?" and getting a different answer every time you ask.

Solution 1: The "Trust Score" (Stratification)

The Analogy: The Jury Selection

Imagine you are a judge trying to solve a mystery. You have 100 witnesses. Some witnesses are known to be very sharp and consistent; others are known to be forgetful or prone to exaggeration.

Instead of listening to everyone equally, you decide to split the witnesses into two groups:

The "Sharp" Group: People who gave consistent answers across all their health questions.
The "Fuzzy" Group: People whose answers jumped around a lot.

How they did it:
The authors created a "Reliability Score" for every single person. They looked at how much a person's answers changed over time. If your answers were all over the place, you got a low score. If you were consistent, you got a high score.

The Result:
When the researchers looked only at the "Sharp" group, the patterns in the data became crystal clear.

Better Connections: They could see that diseases like high blood pressure and heart attacks were much more tightly linked in the "Sharp" group than in the "Fuzzy" group.
Clearer Clusters: It was like looking at a blurry photo and suddenly putting on glasses. Diseases that belong together (like different types of gut issues) started grouping together naturally, whereas in the "Fuzzy" group, they were scattered randomly.
Better Predictions: When they tried to predict future health issues, the models trained on the "Sharp" group were much more accurate.

When to use this: Use this if you have a huge crowd of people and you can afford to ignore the "fuzzy" ones to get a cleaner picture.

Solution 2: The "Time-Traveling Detective" (Bayesian Adjustment)

The Analogy: The Detective's Best Guess

Now, imagine you can't throw anyone away. Maybe you only have a small group of people, or you need every single data point. What do you do with the person who said "45" the first time and "52" the second time?

Instead of picking one or the other, the researchers acted like a detective using a special math formula (Bayesian statistics). They treated the "true" age of onset as a secret that is hidden behind two noisy clues.

Clue A: The answer given at the start (Age 45).
Clue B: The answer given later (Age 52).

The detective knows that memory gets worse as people get older and as more time passes. So, the math formula weighs the clues. It asks: "Given that the person is older now, and given that memory fades, what is the most likely true age?"

It doesn't just pick 45 or 52. It calculates a "corrected" age (maybe 48.5) that accounts for the fact that human memory is imperfect. It essentially smooths out the bumps in the road.

The Result:

Stronger Links: Just like the "Trust Score" method, this method made the connections between related diseases stronger and more logical.
Supercharged Predictions: When they used these "corrected" ages to predict things like diabetes, the predictions got significantly better. The more variables they fixed at once, the better the results became.

When to use this: Use this if you have a small group of people and can't afford to lose any data, or if you need to keep the uncertainty in your calculations (like a detective noting how sure they are about their guess).

The Big Picture: Which Tool to Use?

The authors give a simple guide for doctors and researchers:

Use the "Trust Score" (Solution 1) if: You have a massive dataset (like a stadium full of people). You can afford to set aside the "fuzzy" answers to get a super-clear view of the "sharp" ones. It's fast and easy to explain.
Use the "Time-Traveling Detective" (Solution 2) if: You have a smaller group, or you need to keep every single person in the study. It's more complex math, but it saves your data and gives you a "corrected" version of the truth.

Why This Matters

Health surveys are the backbone of modern medicine. If the data is messy, our understanding of disease is messy. By fixing these inconsistencies, this paper helps scientists draw a clearer, more accurate map of human health. It's the difference between navigating with a blurry, torn map versus a high-definition GPS.

1. Problem Statement

Longitudinal healthcare surveys, such as the Canadian Partnership for Tomorrow's Health (CanPath), rely heavily on self-reported data regarding the age of onset for various medical conditions. A critical challenge in these datasets is onset age inconsistency, where participants report different ages for the same condition across different survey waves (e.g., enrollment vs. follow-up).

Causes: These inconsistencies stem from measurement errors, including memory lapses, recall bias, and careless responding.
Current Limitations: Existing approaches to handle this data are insufficient:
- Discarding data: Removing inconsistent records leads to substantial data loss.
- Deterministic rules: Simple rule-based adjudication (e.g., always taking the earliest or latest date) fails to quantify uncertainty or account for the magnitude of error.
- Lack of granularity: Previous studies quantify reliability at the disease level but fail to provide participant-level metrics or statistically principled adjustments for existing datasets.
Impact: Inconsistencies introduce measurement error that attenuates effect estimates, weakens correlations between biologically related conditions, and degrades predictive model performance.

2. Methodology

The authors propose two distinct methods to address these inconsistencies using data from 97,408 CanPath participants (57.1% of whom exhibited at least one onset age inconsistency).

Method A: Reliability Score-Based Stratification

This approach aggregates inconsistency patterns to create a participant-level metric for data quality, allowing researchers to filter or stratify the cohort.

Data Construction: An age difference matrix $D$ is constructed where $D_{ij} = X^{(f)}_{ij} - X^{(e)}_{ij}$ (follow-up age minus enrollment age).
Matrix Completion: Missing values in $D$ are imputed using SoftImpute, assuming reliability depends on the magnitude of the discrepancy, not the direction (over- vs. under-reporting).
Dimension Reduction: Principal Component Analysis (PCA) is applied to the absolute difference matrix to capture the dominant patterns of inconsistency.
Score Construction: A raw reliability score ( $r_i$ ) is computed as a weighted sum of the absolute PCA component scores. Higher raw scores indicate greater deviation from consistent patterns.
Normalization & Stratification: Scores are quantile-normalized to $[0, 1]$ (inverted so higher = more reliable). Participants are stratified into high-reliability and low-reliability cohorts (typically split at the median) for downstream analysis.

Method B: Bayesian Adjustment

This approach models the inconsistency as a measurement error problem to produce adjusted estimates for the "true" latent onset age.

Latent Variable Model: The true onset age ( $X^*_{ij}$ ) is treated as a latent variable. The observed enrollment ( $X^{(e)}_{ij}$ ) and follow-up ( $X^{(f)}_{ij}$ ) ages are modeled as noisy observations:
$X^{(e)}_{ij} \sim N(X^*_{ij}, \sigma^{(e)2}_j)$
$X^{(f)}_{ij} \sim N(X^*_{ij}, \sigma^{(f)2}_j)$
Variance Parameterization: The model explicitly accounts for age-dependent and inter-survey time effects:
- Variance increases with the participant's age at enrollment ( $\alpha_{j1} \geq 0$ ).
- Variance increases with the time gap between surveys ( $\delta_{j1} \geq 0$ ).
- Follow-up variance is constrained to be higher than enrollment variance.
Parameter Estimation: Variance parameters are estimated by maximizing the log-likelihood of the observed age differences ( $D_{ij}$ ).
Posterior Imputation: Assuming a diffuse prior, the posterior distribution of the true onset age is derived. The adjusted value is the precision-weighted average of the two observations, giving more weight to the observation with lower estimated variance (typically the enrollment report).

3. Key Contributions

Participant-Level Reliability Metrics: Unlike prior work focusing on disease-level reliability, this paper introduces a scalable procedure to generate individual reliability scores, enabling cohort stratification.
Statistically Principled Adjustment: The proposed Bayesian framework moves beyond deterministic rules by modeling the specific structure of recall error (age and time dependence) and providing uncertainty-aware adjusted estimates.
Comprehensive Evaluation: Both methods are rigorously tested on association discovery (correlations, clustering) and predictive modeling (classification, regression) tasks.
Practical Guidance: The authors provide clear criteria for practitioners to choose between stratification (for large datasets) and Bayesian adjustment (for limited samples or complex variable interactions).

4. Results

The methods were evaluated on the CanPath dataset across multiple tasks:

A. Association Discovery & Disease Clustering

Stronger Correlations: In the high-reliability cohort (Method A), pairwise correlations between biologically related conditions (e.g., asthma and high blood pressure) were consistently stronger than in the low-reliability cohort (differences ranging from 0.02 to 0.72).
Coherent Clustering: Disease onset networks constructed from the high-reliability cohort showed significantly better biological coherence.
- Metric: The proportion of diseases in the dominant medical category within a cluster increased from 30.9% (low-reliability) to 43.8% (high-reliability).
- Metric: Cluster entropy decreased from 2.23 to 1.86, indicating tighter grouping of related conditions (e.g., gastrointestinal diseases clustered together only in the high-reliability group).
Bayesian Adjustment: Applying Bayesian adjustments to inconsistent variables also strengthened correlations between biologically linked pairs (e.g., Anxiety/Depression, High BP/Heart Attack) compared to using raw enrollment or follow-up data.

B. Predictive Modeling

Regression Tasks: The high-reliability cohort consistently yielded lower prediction errors (MAE and RMSE) for predicting onset ages (e.g., depression, hearing loss).
Classification Tasks: Results were mixed. While high-reliability cohorts improved predictions for physical conditions (diabetes, high blood sugar), they sometimes underperformed for mental health variables (depression), suggesting mental health recall patterns differ from physical conditions.
Bayesian Adjustment Benefits:
- Consistently improved predictive performance across all tasks.
- Compounding Effect: The greatest gains were observed when multiple inconsistent variables were adjusted simultaneously. For example, predicting diabetes onset age saw an 18% reduction in MAE (from 5.65 to 4.63 years) when both high blood pressure and high cholesterol onset ages were adjusted.
- The uncertainty introduced by the adjustment (measured by confidence interval widening) was modest relative to the gains in point estimate accuracy.

5. Significance and Conclusion

This paper addresses a pervasive but often overlooked issue in longitudinal health research: the degradation of data quality due to inconsistent self-reports.

Scientific Impact: By filtering for high-reliability participants or adjusting for measurement error, researchers can recover stronger biological signals, leading to more accurate disease clustering and association discovery.
Methodological Innovation: The work bridges the gap between simple data cleaning and complex statistical modeling, offering a flexible toolkit for handling measurement error in survey data.
Practical Utility:
- Stratification is recommended for large-scale studies where excluding low-quality data is feasible and ease of deployment is prioritized.
- Bayesian Adjustment is preferred for smaller sample sizes, when uncertainty propagation is required, or when dealing with mental health variables where exclusion might introduce bias.

The authors conclude that these methods substantially strengthen the utility of longitudinal healthcare surveys, enabling more robust epidemiological insights and predictive modeling. Future work aims to extend these methods to multiple survey waves and other types of longitudinal inconsistencies (e.g., status changes from "yes" to "no").