Integrating Group and Individual Fairness Auditing in Clinical AI: A Post-Hoc, Model-Agnostic Approach

This paper introduces EquiLense, a practical, post-hoc, and model-agnostic auditing tool that bridges the gap between group and individual fairness assessments in clinical AI by utilizing a novel metric called Mean Predicted Probability Difference (MPPD) to identify systematic prediction inconsistencies across demographic groups.

Original authors: Xu, J., Hwang, Y. M., Kondareddy, S., Dormoy, I., Jing, S. L., Pillai, M., Curtin, C. M., Hernandez-Boussard, T.

Published 2026-04-30
📖 5 min read🧠 Deep dive

Original authors: Xu, J., Hwang, Y. M., Kondareddy, S., Dormoy, I., Jing, S. L., Pillai, M., Curtin, C. M., Hernandez-Boussard, T.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a very smart, automated assistant that helps doctors predict how a patient might do after surgery. This assistant is great at its job overall, but there's a nagging worry: Is it treating everyone fairly?

Sometimes, these assistants might be unfair in two different ways:

  1. Group Unfairness: It consistently gives worse predictions for one entire group of people (like a specific race or gender) compared to another.
  2. Individual Unfairness: It treats two patients who are medically identical (same age, same health issues, same surgery) differently just because they belong to different groups.

The problem is that most tools used to check for fairness only look at one of these angles. They might check if Group A gets worse scores than Group B, but they miss the fact that two specific, identical patients are being treated differently. Or they check if identical patients are treated the same, but miss the bigger picture of systemic bias against a whole group.

Enter "EquiLense": The Fairness Glasses

The authors of this paper created a new tool called EquiLense. Think of it as a pair of "fairness glasses" that a doctor or developer can put on after the AI model is already built and working. You don't have to rebuild the engine; you just look through the glasses to see what's really happening.

EquiLense does three main things to give a complete picture:

  1. The Group Check: It looks at the big picture to see if certain demographic groups are getting systematically worse predictions than others.
  2. The Individual Check: It finds pairs of patients who are medically twins (same age, same health history) and checks if the AI gives them the same prediction. If it gives one a "high risk" score and the other a "low risk" score just because of their race or insurance, that's a red flag.
  3. The "Mean Predicted Probability Difference" (MPPD): This is the paper's secret sauce. It's a new way of measuring the gap between those "medical twins."

Here is a simple analogy for MPPD:
Imagine you are a judge sentencing two people who committed the exact same crime with the exact same history.

  • Fairness: Both get 5 years.
  • Unfairness: One gets 5 years, and the other gets 10 years just because they are from a different neighborhood.

MPPD is like a ruler that measures exactly how much extra time the second person got compared to the first, on average, across the whole courtroom. It quantifies the "unfair gap" between people who should be treated the same.

What Did They Find?

The team tested EquiLense on real hospital data involving over 59,000 surgical patients. They looked at models predicting two things: delirium (confusion after surgery) and readmission (coming back to the hospital within 30 days).

  • The Surprise: The AI models were actually quite good at predicting outcomes overall (they were accurate). However, when they put on the EquiLense glasses, they found that the models were still treating "medical twins" differently based on race.
  • The Specific Example: For patients who were medically identical to White patients, Asian patients were getting systematically different (and less fair) predictions. The "gap" in their scores was measurable and significant.
  • The Fix Test: They tried a simple experiment: they told the AI to ignore race and insurance type when making its predictions. When they did this, the "unfair gap" (the MPPD score) shrank significantly. This suggests that simply removing those specific data points from the model's "brain" made it treat similar patients more equally, without making the model worse at its job.

Did It Work on Other Problems?

To make sure their new ruler (MPPD) actually worked, they tested it on two famous, non-medical datasets where bias was already known to exist:

  1. COMPAS: A tool used to predict if criminals will re-offend. (We know this tool has historically been biased against Black defendants).
  2. UCI Adult Income: A dataset predicting if someone earns over $50k. (We know this has historical gender bias).

The Result: EquiLense's MPPD metric successfully flagged the exact groups we already knew were being treated unfairly (Black defendants in the COMPAS data and women in the income data). This proved the tool works.

Why Does This Matter?

The paper argues that we need a tool that doesn't require us to throw away our current AI models and start over (which is expensive and hard). Instead, we need a way to audit them after they are built.

EquiLense is like a quality control inspector for AI in healthcare. It doesn't fix the machine for you, but it gives you a clear, easy-to-understand report card that says: "Hey, your machine is good at math, but it's treating these two identical patients differently just because of their background."

This allows doctors and developers to make informed choices, like deciding whether to remove certain data points (like race) from the model to make it fairer, without needing to be math wizards or rebuild the entire system from scratch.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →