Evaluating differential item functioning in the EQ-5D-5L in acute ischemic stroke

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a judge trying to decide which of two doctors is better at helping patients recover from a stroke. To do this fairly, you ask the patients to rate their own "health happiness" using a specific checklist called the EQ-5D-5L. This checklist asks five simple questions about things like walking, washing yourself, doing daily chores, feeling pain, and feeling anxious.

But here's the catch: What if the checklist itself is biased?

What if an 80-year-old and a 60-year-old with the exact same level of health recovery answer the questions differently, not because one is actually sicker, but because they interpret the words differently? For example, an 80-year-old might think, "Well, I can't wash my back as fast as I used to, so I'll mark 'some problems,'" while a 60-year-old with the same limitation might think, "I can still wash myself, so I'll mark 'no problems'."

If this happens, the checklist is "broken" for comparing groups. In the world of science, this is called Differential Item Functioning (DIF). It's like using a ruler that stretches differently depending on who is holding it.

The Study: Checking the Ruler

The researchers in this paper took a huge group of stroke patients (over 1,200 people) who were part of a major trial comparing two clot-busting drugs (Alteplase vs. Tenecteplase). They wanted to see if the EQ-5D-5L checklist was fair when comparing:

Men vs. Women
Drug A vs. Drug B
Younger patients (<80) vs. Older patients (≥80)

They used a sophisticated statistical "microscope" (called Item Response Theory) to look closely at how people answered.

The Findings: The Verdict

1. Men vs. Women & Drug A vs. Drug B: The Ruler is Perfect.
The study found that the checklist works exactly the same way for men and women, and for people taking either drug. There was no bias here. If a man and a woman have the same health, they give the same score. This means the trial results comparing the two drugs are trustworthy.

2. Young vs. Old: A Tiny Stretch.
This is where it gets interesting. The study found that the checklist did behave slightly differently for older people (80+) compared to younger people.

The "Stretch": Older people tended to report slightly more trouble with self-care (like washing/dressing) and usual activities (like housework) than younger people with the same actual health level.
The Analogy: Imagine a ruler where the "1-inch" mark is slightly closer to the "0-inch" mark for older people. They feel a tiny bit more "off" than the ruler says they should be.
The Reality Check: However, the researchers measured how much this actually mattered. They found that even though the "stretch" was statistically detectable, it was tiny in the real world.
- If you took the scores with the "stretch" and without it, the results were 98% identical.
- It's like measuring a marathon runner's time with a watch that is off by 0.1 seconds. Technically, it's not perfect, but it doesn't change who won the race.

Why This Matters

In the past, scientists might have panicked and said, "Oh no, the test is biased! We can't use these results!"

But this paper says: "Relax."
Even though older people interpret the questions slightly differently (perhaps because they have lower expectations for what they should be able to do), the overall score they get is still a very accurate reflection of their health.

The Bottom Line

The EQ-5D-5L checklist is a fair and reliable tool for stroke trials.

You can compare men and women without worry.
You can compare different drugs without worry.
You can compare young and old patients without needing to do complex math corrections.

The "bias" found in older patients is so small that it doesn't distort the big picture. It's a minor quirk in the ruler, not a broken tool. This gives doctors and researchers confidence that when they use this checklist to measure recovery, they are seeing the truth about the patients' health, not just a measurement error.

1. Problem Statement

Health-related quality of life (HRQOL) is a critical secondary endpoint in stroke clinical trials, with the EQ-5D-5L (five-level EuroQOL questionnaire) being the most widely adopted Patient-Reported Outcome Measure (PROM). However, a significant psychometric gap exists: PROMs in stroke trials are rarely evaluated for Differential Item Functioning (DIF).

Definition of DIF: DIF occurs when individuals with the same underlying latent trait (true HRQOL) interpret and respond to questionnaire items differently based on group characteristics (e.g., age, sex, treatment).
The Risk: If DIF is present, observed differences in HRQOL scores between subgroups may reflect measurement artifacts rather than true health differences. This can lead to biased treatment comparisons, reduced statistical power, and erroneous conclusions in randomized controlled trials (RCTs).
Knowledge Gap: While DIF in the EQ-5D has been studied in observational settings, there is limited evidence regarding its impact on treatment effect conclusions in acute ischemic stroke RCTs, specifically across age, sex, and treatment arms.

2. Methodology

The study utilized data from the AcT trial (Alteplase Compared to Tenecteplase), a pragmatic, multicenter, registry-linked RCT comparing intravenous tenecteplase vs. alteplase in acute ischemic stroke patients.

Data Source: 1,264 patients with complete EQ-5D-5L data at 90 days post-stroke (from an initial cohort of 1,577).
Subgroups Analyzed:
- Age: <80 years vs. ≥80 years.
- Sex: Male vs. Female.
- Treatment: Alteplase vs. Tenecteplase.
Statistical Approach:
- Model: Graded Response Model (GRM), an Item Response Theory (IRT) approach suitable for ordinal data.
- Fit Indices: Model adequacy was assessed using Comparative Fit Index (CFI), Tucker-Lewis Index (TLI), and Standardized Root Mean Square Residual (SRMSR).
- DIF Detection: A multigroup Wald-based sweep procedure with structural parameter adjustment. This advanced method accounts for true differences in underlying health status (latent trait distributions) between groups, preventing the confounding of group mean differences with item bias.
- Effect Size Quantification: Used Signed Weighted Area Between Curves (sWABC) to measure the practical magnitude of DIF.
  - Thresholds: |sWABC| < 0.10 (negligible), 0.10–0.29 (small), 0.30–0.49 (moderate), ≥0.50 (large).
- Scale-Level Analysis: Calculated Differential Test Functioning (DTF) to assess bias at the total score level.
- Software: R version 4.5.1 using the mirt package.

3. Key Contributions

Rigorous DIF Framework: The study applies a sophisticated IRT-based DIF detection method (Wald-based sweep with structural adjustment) specifically tailored for short instruments (5 items) in a stroke RCT context, addressing the limitation of simpler methods that conflate group means with item bias.
Differentiation of Statistical vs. Practical Significance: The authors emphasize the critical distinction between statistically significant DIF (driven by large sample sizes) and practically meaningful DIF (measured by effect size), providing a nuanced interpretation of psychometric data.
Validation for Economic Evaluation: By confirming measurement invariance, the study supports the use of EQ-5D-5L for calculating Quality-Adjusted Life Years (QALYs) in cost-effectiveness analyses for stroke interventions.

4. Key Results

Model Fit: The GRM demonstrated acceptable fit (CFI = 0.97, TLI = 0.93, SRMSR = 0.07). Local dependence was detected but was negligible in magnitude.
Omnibus DIF Testing:
- Age: Significant DIF detected ( $\chi^2 = 86.9, p < 0.001$ ).
- Sex: No significant DIF ( $\chi^2 = 31.7, p = 0.063$ ).
- Treatment: No significant DIF ( $\chi^2 = 22.4, p = 0.379$ ).
Item-Level Findings (Age):
- Four items showed statistically significant DIF: Self-care, Usual activities, Pain/discomfort, and Anxiety/depression.
- Effect Sizes:
  - Moderate DIF: Self-care (sWABC = -0.46) and Usual activities (sWABC = -0.34). Older adults (≥80) reported greater difficulty on these items than younger adults with equivalent underlying HRQOL.
  - Negligible DIF: Pain/discomfort (sWABC = -0.002) and Anxiety/depression (sWABC = 0.09).
Impact on Scores:
- Correlation: Factor scores from DIF-adjusted and unadjusted models were highly correlated ( $r = 0.98$ ).
- Mean Difference: The unadjusted model slightly overestimated impairment in older patients by a mean difference of 0.37 points (on a 0–20 scale).
- DTF: Signed DTF was -1.06 (5.3% of scale range), indicating a modest systematic bias favoring younger patients, but not large enough to invalidate group comparisons.

5. Significance and Conclusion

Clinical Trial Validity: The EQ-5D-5L functions equivalently across sex and treatment groups in acute ischemic stroke. Observed differences in HRQOL between alteplase and tenecteplase arms can be confidently attributed to true treatment effects rather than measurement bias.
Age Considerations: While statistically detectable age-related DIF exists in physical functioning items (self-care, usual activities), the practical impact is minimal. The high correlation between adjusted and unadjusted scores suggests that group-specific scoring adjustments are unnecessary for clinical trials or health economic evaluations.
Interpretation of DIF: The findings suggest that age-related differences in responses to physical items may reflect genuine shifts in patient expectations or adaptation to functional norms rather than instrument bias.
Recommendation: The EQ-5D-5L remains a valid, robust instrument for HRQOL assessment in heterogeneous stroke populations, supporting its continued use as a primary secondary endpoint in future stroke trials without the need for complex DIF corrections.

Evaluating differential item functioning in the EQ-5D-5L in acute ischemic stroke

The Study: Checking the Ruler

The Findings: The Verdict

Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Conclusion

More like this

Tau pathological activity in plasma before the onset of symptomatic Alzheimer s disease

MRI Characterization of Structural Brain Abnormalities in NGLY1 Deficiency

Trends in thiamine treatment patterns for Wernicke encephalopathy in Japan for 2010-2023: A nationwide descriptive study

Consistency of Serial CSF alpha-Synuclein Seed Amplification Assay Results in the Parkinson's Progression Marker Initiative

Evidence for bilingualism as a cognitive reserve factor in biomarker-confirmed Alzheimer's disease