Investigations of Heterogeneity in Diagnostic Test Accuracy Meta-Analysis: A Methodological Review

This methodological review of 100 diagnostic test accuracy meta-analyses published in 2024 reveals that while investigations of heterogeneity are common and more frequent with larger numbers of primary studies, they often suffer from limited data support for subgroups, unclear reporting of statistical models, and insufficient prespecification in protocols.

Lukas Mischinger, Angela Ernst, Bernhard Haller, Alexey Formenko, Zekeriya Aktuerk, Alexander Hapfelmeier

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to solve a mystery: "Does this new medical test actually work?"

To get the best answer, you don't just look at one police report; you gather hundreds of reports from different cities, different times, and different detectives. You combine them all into one big "Meta-Analysis" to get the most accurate picture possible.

But here's the catch: Not all police reports are the same. Some detectives work in rainy cities, others in sunny ones. Some use high-tech cameras, others use old flashlights. Sometimes the test works great for young people but poorly for the elderly. This difference is called Heterogeneity.

This paper is a "report on the reports." The authors looked at 100 recent studies that tried to combine medical test results. They wanted to see: Are the detectives (researchers) actually investigating why the results are different, or are they just guessing?

Here is the breakdown of their findings, using some everyday analogies:

1. The "Data Buffet" Problem

The researchers found that when a study had a huge buffet of data (lots of primary studies), the authors were much more likely to start investigating the differences.

  • The Analogy: If you only have three apples, you probably won't try to sort them by color, size, and sweetness. But if you have 500 apples, you'll definitely start sorting them!
  • The Finding: Studies with more data were more likely to do the "sorting" (investigating heterogeneity). However, even with a big buffet, the "portions" were often too small. On average, they were trying to sort the apples based on data from only six other studies. That's like trying to judge the quality of a whole orchard by tasting just six apples. It's a bit risky!

2. The "Recipe" Confusion (Statistical Models)

When combining these test results, researchers use complex mathematical "recipes" (statistical models) to make sure the numbers add up correctly.

  • The Analogy: Imagine everyone is trying to bake a cake. Some are using a fancy, modern oven that bakes the top and bottom evenly (the Bivariate Model). Others are using two separate toasters (the Univariate Model), which might burn one side and leave the other raw.
  • The Finding: Most researchers used the fancy oven (64%), which is good! But a huge chunk (32%) were still using the two separate toasters. Even worse, many didn't even write down which recipe they used. They just said, "I used a computer program." It's like saying, "I baked a cake," without telling us if it was a sponge cake or a brick.

3. The "Fishing Expedition" (Too Many Questions)

This is the most critical warning in the paper. When researchers look for differences, they can start asking too many questions.

  • The Analogy: Imagine you are fishing in a lake. If you cast your net once, you might catch a fish. If you cast your net 100 times in different spots, you are guaranteed to catch something, even if it's just a boot or a soda can. You might think, "Wow, I found a boot! That's a pattern!" But it's just a fluke.
  • The Finding: The researchers found that studies which found "significant" differences (like the boot) were the ones that asked the most questions. They were fishing so hard that they inevitably found something that looked interesting but was actually just random noise. This is called a spurious finding.

4. The "Plan vs. Panic" (Prespecification)

Good science requires a plan. You should decide before you start looking what you are going to investigate.

  • The Analogy: A Prespecified investigation is like a chef who decides, "I will taste the soup for salt," before the soup is even cooked. A Post-hoc investigation is like tasting the soup, realizing it's bland, and then saying, "Oh, I was planning to check the salt all along!"
  • The Finding: Only 44% of the studies had a real plan (prespecification). The rest were "panicking" after seeing the results and deciding to investigate whatever looked interesting. This is dangerous because it leads to false conclusions.

The Bottom Line

The authors conclude that while researchers are trying to understand why medical tests work differently in different groups, they are often flying blind.

  • They are often looking at too few data points to be sure.
  • They are using the wrong "recipes" (math models) too often.
  • They are fishing for patterns without a plan, which leads to false alarms.

The Recommendation:
To fix this, researchers need to:

  1. Write a plan first: Decide what you are investigating before you look at the data.
  2. Be honest about the math: Clearly state which "recipe" (model) you used.
  3. Don't over-fish: Don't ask 100 questions just to find one answer. Stick to the important ones.

If they do this, we can trust that when a doctor says, "This test works for Group A but not Group B," it's a real discovery, not just a lucky guess.