Persistent Proxy Discrimination in HIV Testing… — Plain-Language Explanation

Imagine you are a doctor trying to decide who needs a specific medical test. You have a new, high-tech computer program (an AI) to help you make these decisions. The goal is to be fair to everyone.

But here is the twist: The definition of "fair" that the computer scientists are using might actually hurt the people who need help the most.

This paper, written by researcher Hayden Farquhar, uses a real-world example—HIV testing—to show why blindly following a popular rule called "Demographic Parity" can be dangerous in healthcare.

Here is the story in simple terms, using some analogies to make it clear.

1. The Setting: A Flooded City

Imagine two neighborhoods in a city:

Neighborhood A (The High-Risk Zone): This area is currently flooded. 60% of the houses are underwater.
Neighborhood B (The Safe Zone): This area is dry. Only 10% of the houses are flooded.

You have a fleet of rescue boats (the AI model) and a limited amount of fuel. Your job is to send boats to save people.

2. The "Fair" Rule That Goes Wrong

The computer scientists say: "To be fair, we must send the exact same number of rescue boats to Neighborhood A and Neighborhood B."

They call this Demographic Parity. It sounds fair on paper: "Everyone gets the same amount of attention."

But here is the problem:

If you send 50 boats to the flooded neighborhood, you might save 40 people.
If you send 50 boats to the dry neighborhood, you might only find 5 people who actually need saving (because most houses are fine).

By forcing the computer to send the same number of boats to both places, you are wasting fuel in the dry neighborhood and leaving people drowning in the flooded one. You are trying to make the numbers look equal, but you are ignoring the reality of the situation.

3. The Real-World Example: HIV Testing

In this study, the "Flood" is the HIV virus.

Black and Hispanic communities in the US have a much higher rate of HIV (the "flood").
White and Asian communities have a much lower rate (the "dry" area).

The AI was trained to predict who has been tested for HIV. Naturally, the AI learned that people in high-risk areas are more likely to have been tested (because public health campaigns target them). So, the AI recommended testing more often for Black and Hispanic people.

The "Fairness" Fix:
When researchers tried to make the AI "fair" using Demographic Parity, they forced the AI to recommend testing at the same rate for everyone, regardless of risk.

The Result:

The AI stopped recommending tests for the high-risk groups (because it had to lower its numbers to match the low-risk groups).
The AI started recommending tests for low-risk groups (because it had to raise its numbers to match the high-risk groups).

The Cost:
The study found that by doing this "fair" thing, the AI missed 1,610 additional people in the test group who actually needed screening. It was like stopping the rescue boats in the flooded neighborhood just to make sure the dry neighborhood got the same number of boats.

4. The "Race-Blind" Trap

The researchers tried a different trick: they told the computer, "Don't look at race at all. Just ignore it."

They thought this would fix the problem. But it didn't work well.

Analogy: Imagine you tell a firefighter, "Don't look at the color of the building." But the firefighter can still see that the building is made of wood (a proxy for fire risk) and is located next to a gas station.
Even without knowing the "race" variable, the AI saw other clues like income, where people lived, and access to healthcare. These clues are linked to race because of historical inequality. So, the AI still ended up treating groups differently, just in a more confusing way.

5. The Better Way: "Equalized Odds"

So, what is the right way to be fair? The paper suggests Equalized Odds.

The Analogy: Instead of sending the same number of boats, you ensure that the boats are equally good at finding people who are drowning, no matter which neighborhood they are in.
If 100 people in Neighborhood A are drowning, the boat should find 90 of them.
If 100 people in Neighborhood B are drowning, the boat should also find 90 of them.

This allows you to send more boats to the flooded neighborhood (because there are more people there) while ensuring the boat is just as accurate in both places. This is fairness based on need, not just equal numbers.

6. The "Intersectional" Surprise

The study also found a tricky side effect. When they fixed the fairness for Race, they accidentally made it unfair for Gender.

Analogy: Imagine you fix the water level for the whole city, but in doing so, you accidentally flood the basement of a specific apartment building where women live.
By focusing only on race, the AI started treating men and women differently in new, unfair ways. To fix this, you have to look at the intersection of both (Race + Gender) at the same time, which is very hard to do.

The Big Takeaway

The main lesson of this paper is: In healthcare, "Fair" does not always mean "Equal Numbers."

In a bank: If a loan algorithm gives loans to 50% of Group A and 50% of Group B, that is fair.
In a hospital: If a disease is 5 times more common in Group A, a fair system should recommend treatment 5 times more often for Group A.

If we force the hospital system to treat everyone exactly the same (Demographic Parity), we end up ignoring the people who are actually sick.

The Conclusion:
We need to stop using "Demographic Parity" as a default rule for medical AI. Instead, doctors and communities need to sit down and decide: "What does fairness actually look like for this specific disease?" Usually, it means making sure the AI is accurate for everyone, not that it treats everyone exactly the same.

1. Problem Statement

The paper addresses a critical gap in healthcare algorithmic fairness: the inappropriate application of Demographic Parity (DP) in clinical contexts where disease burden varies significantly across demographic groups.

The Conflict: In fields like lending or hiring, Demographic Parity (equal positive prediction rates across groups) is often the standard for fairness. However, in healthcare, disease prevalence (base rates) often differs due to structural inequities and epidemiological reality.
The Hypothesis: Enforcing Demographic Parity on models predicting HIV testing uptake will force a reduction in screening recommendations for high-burden populations (e.g., Black and Hispanic communities) to artificially equalize rates with low-burden populations, thereby causing clinical harm by missing high-risk individuals.
The Gap: While fairness audits are common, few studies examine the consequences of metric selection in differential-burden contexts, specifically regarding HIV.

2. Methodology

The study utilized a rigorous machine learning and fairness audit framework using nationally representative data.

Data Source:
- Primary: Behavioral Risk Factor Surveillance System (BRFSS) 2024 dataset ( $N=386,775$ adults).
- Outcome: Self-reported history of HIV testing (binary).
- Predictors: Age, sex, race/ethnicity (8 categories), income, depression diagnosis, cost barriers, self-rated health, and geography.
- Validation: External validation using Ryan White HIV/AIDS Program data (viral suppression rates) and CDC AtlasPlus data to confirm real-world burden disparities.
Models: Four classifiers were trained: Logistic Regression (LR), Random Forest (RF), Gradient Boosting (GB), and XGBoost.
Fairness Metrics Evaluated:
- Demographic Parity Difference (DPD): Difference in positive prediction rates (selection rates) between groups.
- Equalized Odds Difference (EOD): Difference in True Positive Rates (TPR) and False Positive Rates (FPR) conditioned on the actual outcome.
- Calibration: Accuracy of predicted probabilities across groups (Brier scores).
Mitigation Strategies:
- Threshold Optimization (Post-processing): Adjusting decision thresholds per group to equalize selection rates (DP).
- Exponentiated Gradient (In-processing): Incorporating fairness constraints directly into model training.
- Race-Blind Modeling: Training models without race features to test if disparity persists via proxy variables.
Intersectional Analysis: Evaluated fairness across race $\times$ sex subgroups to detect cross-dimensional trade-offs.

3. Key Contributions

Demonstration of Metric-Induced Harm: The paper provides empirical evidence that enforcing Demographic Parity in a differential-burden context (HIV) actively reduces clinical utility by lowering sensitivity for the populations with the highest disease burden.
Quantification of Clinical Cost: The study quantifies the "human cost" of fairness optimization, calculating the number of additional missed diagnoses resulting from DP enforcement.
Analysis of Proxy Variables: It demonstrates that removing explicit race features does not eliminate disparity; ~70% of the differential prediction persists through race-correlated social determinants (income, insurance, geography), which are themselves shaped by structural inequities.
Cross-Dimensional Trade-offs: The study reveals that optimizing for racial fairness can degrade sex-based fairness, highlighting the need for multi-objective optimization.
Normative Argument for Metric Selection: It argues that Equalized Odds and Calibration are superior fairness criteria for healthcare contexts with unequal base rates, whereas Demographic Parity is inappropriate.

4. Key Results

Baseline Performance

Standard ML models achieved moderate discrimination (AUC 0.67–0.71).
Differential Prediction: Baseline selection rates mirrored actual HIV testing prevalence.
- Black population: 66.0% selection rate (Actual testing: 57.8%).
- Asian population: 12.1% selection rate (Actual testing: 27.2%).
- DPD: 0.519–0.634 (far exceeding the 0.1 threshold), but this reflected real epidemiological differences rather than algorithmic bias.

Impact of Demographic Parity (DP) Mitigation

Enforcing DP to equalize selection rates resulted in severe clinical consequences:

Sensitivity Collapse: For Black individuals, the True Positive Rate (TPR) dropped from 78.2% to 30.0% (a 61.6% relative decrease).
Missed Diagnoses: In the test set alone, this resulted in 1,610 additional missed individuals among Black respondents and 1,361 among Hispanic respondents.
Performance Degradation: Exponentiated Gradient mitigation reduced the model's AUC from 0.671 to 0.592 (11.8% relative decrease).
Inversion of Priorities: DP forced the model to recommend more screening for low-burden groups (White, Asian) and less for high-burden groups, inverting public health priorities.

Intersectional and Cross-Dimensional Effects

Race-Only Optimization: Optimizing solely for racial fairness worsened sex-based disparity by 71% (Sex DPD increased from 0.103 to 0.176).
Intersectional Optimization: Jointly constraining race and sex reduced intersectional DPD (0.609 $\to$ 0.076) but maintained the fundamental cost: Black Male selection rates still plummeted from 69.9% to 9.0%.

Race-Blind Models

Removing race variables from the feature set eliminated only 30% of the disparity.
70% of the disparity persisted because social determinants (cost barriers, depression, geography) act as proxies for race and are valid predictors of HIV risk/utilization patterns.

Calibration

Calibration was worse for minority groups (Hispanic Brier scores were 0.02–0.04 higher than White), indicating a genuine fairness issue where predictions are less reliable for these groups, independent of selection rates.

5. Significance and Conclusions

Clinical Decision vs. Technical Detail: The choice of fairness metric is a clinical decision with life-or-death consequences. Applying Demographic Parity to HIV screening is "statistically fair" but "clinically harmful."
Recommendation for Healthcare AI:
- Avoid Defaulting to DP: Demographic Parity should not be the default metric in healthcare contexts with unequal base rates.
- Adopt Equalized Odds/Calibration: These metrics ensure the model is equally accurate and reliable across groups while allowing selection rates to reflect genuine differences in disease burden.
- Stakeholder Deliberation: Metric selection must involve clinicians and community stakeholders to define what "fairness" means in a specific clinical context (e.g., maximizing detection vs. equalizing access).
Broader Implication: This study serves as a cautionary tale for the "naive fairness auditing" movement, warning that applying fairness frameworks from non-clinical domains (like lending) to healthcare without considering epidemiological context can exacerbate health inequities.

In summary, the paper argues that fairness in healthcare requires aligning algorithmic behavior with clinical need, not statistical equality of output rates. Enforcing equal screening rates across groups with vastly different HIV burdens effectively denies care to those who need it most.

Persistent Proxy Discrimination in HIV Testing Prediction Models: A National Fairness Audit of 386,775 US Adults