The false positive paradox: Examining real-world clinical predictive performance of FDA-authorized AI devices for radiology using clinical prevalence

This study analyzes FDA-authorized radiology AI devices to demonstrate how low disease prevalence creates a false positive paradox that undermines positive predictive value, arguing for the mandatory disclosure of false discovery and omission rates to guide clinically and ethically appropriate AI selection.

Sparnon, E., Stevens, K., Song, E., Harris, R. J., Strong, B. W., Bruno, M. A., Baird, G. L.

Published 2026-03-27
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: The "False Alarm" Trap

Imagine you buy a super-smart security system for your house. The company tells you it is 99% accurate. They say, "If a burglar is there, it will almost certainly catch them! And if no one is there, it will almost certainly stay quiet!"

You feel safe. But then, you move into a neighborhood where burglaries are incredibly rare—maybe only 1 in 1,000 houses gets broken into in a whole year.

Suddenly, your alarm starts going off every single day.

You rush outside, but there is no burglar. It was just a cat, a falling branch, or a gust of wind. You check your phone, and the app says, "Intruder Detected!" 365 times a year. Even though the system is "99% accurate," 99% of those alarms are wrong.

This is the False Positive Paradox described in the paper. It happens when a test is very good at finding a disease, but the disease is so rare that the test ends up screaming "DANGER!" at healthy people far more often than it actually finds sick people.

What the Paper Actually Did

The authors looked at 38 different AI tools used by doctors to read X-rays, CT scans, and MRIs. These tools are approved by the FDA (the US government agency that checks medical devices).

The companies selling these tools brag about their Sensitivity (how good they are at finding the bad stuff) and Specificity (how good they are at ignoring the good stuff). They say things like, "We are 95% accurate!"

The Problem:
The paper found that these companies often hide the most important number: How common is the disease in the real world?

  • In the lab (The "Enriched" Test): The companies tested their AI on a pile of scans where they forced a lot of sick patients to be there (like putting 50 burglars in a neighborhood of 100 houses just to test the alarm). In this fake world, the AI looks amazing.
  • In the hospital (The Real World): The AI is now used on regular patients. Most people are healthy. The disease is rare.

Because the disease is rare, the AI's "high accuracy" numbers turn into a nightmare of False Positives.

The "Cry Wolf" Effect in Radiology

The paper explains that when an AI flags a healthy patient as "sick," it causes a chain reaction:

  1. The Doctor's Dilemma: The doctor sees the AI flag. Even if the doctor thinks, "That looks normal," they are terrified of missing a real case. If they ignore the AI and the patient does have the disease, the doctor could get sued.
  2. Defensive Medicine: To be safe, the doctor orders more tests, more scans, or sends the patient for a biopsy.
  3. The Cost:
    • For the Patient: Unnecessary anxiety, radiation exposure, and invasive procedures for something that wasn't there.
    • For the System: Wasted money and time.
    • For the Doctor: They start to lose trust in the AI because it's "crying wolf" too much.

The "Magic Math" Solution

The paper doesn't just complain; it offers a solution. It says doctors and hospitals don't need to wait for the AI companies to fix their reports. They can do the math themselves using a simple formula (Bayes' Theorem).

The Analogy:
Think of the AI's accuracy as a recipe.

  • The AI company gives you the ingredients (Sensitivity and Specificity).
  • But they forget to tell you the serving size (Prevalence).

If you bake a cake for 100 people using a recipe meant for 10, it's going to be a mess. The paper says: "Don't just look at the ingredients; look at the serving size!"

If a hospital knows that only 1% of their patients have a specific type of brain bleed, they can plug that number into the formula. They will instantly see that even with a "95% accurate" AI, 7 out of every 10 alarms will be false.

The Authors' Recommendations

The authors are asking the FDA and the AI companies to stop hiding the ball. They want:

  1. Honest Reporting: Companies must report how common the disease is in the test data. If they tested on a "super-sick" group, they need to say so.
  2. Real-World Math: Reports should show what the "False Alarm Rate" (False Discovery Rate) would be in a normal hospital, not just in a lab.
  3. Choice: Doctors should be able to choose an AI setting that fits their needs. Sometimes you want to catch every possible disease (even if it means more false alarms). Other times, you want to avoid scaring healthy people. The data should let them choose.

The Bottom Line

Just because a medical AI says it is "99% accurate" doesn't mean it will work well in your local hospital. If the disease is rare, the AI will likely be wrong more often than it is right.

The takeaway: Don't trust the marketing slogan. Look at the math. If the disease is rare, expect a lot of false alarms, and make sure your hospital is ready to handle them without panicking.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →