Fairboard: a quantitative framework for equity assessment of healthcare models

Imagine you have built a fleet of 18 different robot chefs. Their job is to look at a complex, messy kitchen (a patient's brain scan) and perfectly slice out the burnt, dangerous parts (the tumor) without touching the good ingredients.

For years, the only question people asked was: "How fast and accurately can these robots chop?" If a robot could chop 95% of the time, it was considered a success.

But this paper asks a much more important question: "Does every robot chop equally well for every type of kitchen, or do some robots struggle when the kitchen belongs to a specific kind of person?"

The authors of this paper, led by Dr. James Ruffle, built a new tool called Fairboard to answer this. Here is the breakdown of their findings using simple analogies.

1. The Big Discovery: The "Chef" Matters Less Than the "Kitchen"

The researchers tested 18 different AI models (the robots) on 648 different patients (the kitchens). They found something surprising:

It didn't matter which robot you used; it mattered who was cooking.

Think of it like this: If you give a master chef and a novice chef the same difficult, sticky, weirdly shaped dough, they will both struggle. But if you give them a perfect, pre-shaped cookie, they will both succeed.

The study found that the patient's specific biology (their age, sex, the specific type of tumor, and how much of it was already removed by a surgeon) explained far more about whether the AI would succeed or fail than the actual AI model itself.

The Analogy: It's not that one car is better than another; it's that some cars struggle more on muddy roads than on paved ones. The "muddy road" here is the patient's specific medical condition.

2. The Four Ways They Checked for Fairness

The authors didn't just look at the final score. They looked at the problem from four different angles, like inspecting a diamond from every side:

The Group Check (Univariate): They asked, "Do the robots perform worse for older people than younger people?" or "Do they fail more often on women than men?"
- Result: Yes, there were gaps. For example, the robots were generally worse at slicing tumors that had already been partially removed by a surgeon (like trying to finish a puzzle with half the pieces missing).
The Prediction Check (Multivariate): They asked, "If we look at a patient's entire profile (age, sex, tumor type), can we predict if the robot will mess up?"
- Result: Yes. The patient's profile was a crystal ball for the robot's mistakes.
The Map Check (Spatial): They looked at where in the brain the robots made mistakes.
- Result: The robots weren't failing randomly. They had "blind spots." For instance, they were consistently worse at seeing tumors in the right side of the brain compared to the left, or in specific deep areas. It's like a security camera that has a clear view of the front door but a blurry view of the back window.
The "Vibe" Check (Representational): This was the most complex part. They used a high-tech map (called UMAP) to plot patients based on everything about them (their DNA, their age, the shape of their tumor).
- Result: They found that patients who were "different" in a complex, combined way (e.g., a young woman with a rare, low-grade tumor) formed a specific cluster where the robots consistently failed. It wasn't just one thing (like age) causing the failure; it was the unique combination of traits that the robots hadn't seen enough of during training.

3. The "Newer is Better" Myth

The team checked if the newest, fanciest robots (AI models from 2023) were fairer than the older ones (from 2018).

The Verdict: The newer models were slightly better and more fair, but none of them were perfect. None of the 18 models could guarantee they wouldn't make a mistake on a specific type of patient. There is no "magic bullet" AI yet that works perfectly for everyone.

4. The Solution: Fairboard

The authors didn't just point out the problem; they built a tool to fix it. They released Fairboard, which is like a "Fairness Dashboard" for doctors and scientists.

No-Code: You don't need to be a computer programmer to use it. It's like a simple app where you upload your data.
The Function: It takes your AI model and your patient data, runs it through those four "checks" (Group, Prediction, Map, and Vibe), and gives you a report card.
The Goal: It tells you, "Hey, your AI is great at detecting tumors in men, but it's blind to tumors in women," or "Your AI fails when the tumor is in the back of the brain."

Why This Matters

In the past, if an AI said, "I am 90% accurate," doctors assumed it was safe for everyone. This paper says, "Wait a minute. That 90% might be 99% for some people and 60% for others."

If we deploy these tools without checking, we risk building a healthcare system where the AI works great for the "average" patient (often older, male, with common tumors) but fails the most vulnerable or unique patients.

The Takeaway:
The paper is a wake-up call. We can't just build smarter robots; we have to build robots that understand all the different kitchens they might enter. And thanks to Fairboard, we now have a simple way to check if our robots are truly fair before we let them into the operating room.

1. Problem Statement

Despite the proliferation of over 1,000 FDA-authorised AI medical devices, formal equity assessments—evaluating whether model performance is uniform across diverse patient subgroups—are rare. Current medical AI research often relies on low-dimensional, aggregate performance metrics (e.g., mean Dice score) that fail to detect disparities. Existing fairness assessments are largely limited to simple group-level comparisons or aggregate disparity metrics, neglecting critical dimensions such as:

Multivariate Equity: How patient features jointly predict model errors.
Spatial Equity: Whether model errors are anatomically localized.
Representational Equity: Whether high-dimensional latent spaces of patient features reveal axes of algorithmic vulnerability.
Composite Benchmarks: A lack of frameworks to jointly rank models by both accuracy and fairness.

2. Methodology

The authors conducted a comprehensive evaluation of 18 open-source brain tumour segmentation models (spanning BraTS 2018–2023 challenges) across 648 glioma patients from two independent datasets (UCSF-PDGM, $n=501$ ; UPENN-GBM, $n=147$ ), generating 11,664 model inferences.

The study introduces Fairboard, an open-source, no-code dashboard, and applies a four-dimensional equity framework:

A. Data and Models

Inputs: Multimodal MRI (T1, T1ce, T2, FLAIR) with expert manual segmentation labels for four compartments: Whole Tumour (WT), Non-Enhancing Tumour (NET), Enhancing Tumour (ET), and Oedema (OED).
Models: 18 architectures including nnU-Net, Vision Transformers, and various CNNs.
Performance Metrics: 7 metrics (DSC, Sensitivity, Precision, HD95, NSD@1mm, ASD, Volume Similarity) across 4 compartments.
Equity Metrics: 7 health economics inequality indices (Gini, Atkinson, Coefficient of Variation, Generalised Entropy, Hoover, Theil, Palma ratio).

B. Four-Dimensional Analysis Framework

Univariate Equity: Non-parametric statistical tests and stratified visualizations to assess performance gaps across single demographic factors (age, sex, molecular diagnosis, resection extent).
Cohort (Multivariate) Equity: Bayesian Linear Mixed-Effects (LME) models with crossed random intercepts for Patient and Model.
- Fixed Effects: Sex, Age, Dataset Source, WHO CNS Grade, Extent of Resection, Molecular Diagnosis.
- Goal: Quantify variance explained by patient identity vs. model choice and identify predictors of segmentation error.
Spatial Equity: Voxel-wise Generalized Linear Models (GLMs) fitted in MNI152 standard space, followed by a DerSimonian–Laird random-effects meta-analysis across the 18 models.
- Goal: Map anatomically localized biases where specific brain regions consistently yield higher or lower performance.
Representational Equity: Uniform Manifold Approximation and Projection (UMAP) to create a 2D latent space from PCA-compressed lesion masks and clinico-demographic variables.
- Goal: Test if model performance clusters within this nonlinear manifold, indicating that specific combinations of patient features (not just single variables) predict algorithmic vulnerability.

3. Key Results

A. Patient Identity Dominates Model Choice

Variance Decomposition: Patient identity (and lesion characteristics) consistently explained more performance variance than model choice.
- Patient-level Intraclass Correlation Coefficients (ICCs): 0.31 – 0.72.
- Model-level ICCs: 0.04 – 0.22.
Predictors of Performance: Clinical factors were stronger predictors of accuracy than architecture:
- Extent of Resection: Biopsy-only patients showed significant performance deficits ( $\beta \approx -0.77$ to $-0.33$) compared to Gross Total Resection (GTR).
- Molecular Diagnosis: Glioblastoma (IDH-wildtype) lesions were segmented more accurately than other gliomas (e.g., Astrocytoma, IDH-mutant).
- Tumour Grade: Higher grades (Grade 4) correlated with better performance.
- Demographics: Male sex showed a modest positive association with whole-tumour performance; age effects were weak and inconsistent.

B. Spatially Localized Biases

Meta-analysis revealed anatomically specific biases that were consistent across models:
- WT & Oedema: Better performance in the left hemisphere and occipital-sparing lesions.
- NET: Stronger performance in right anterior and bilateral basal ganglia/periventricular regions.
- ET: Better performance in posterior frontal/parietal locations; weaker in anterior frontal regions.
These biases were not merely artifacts of lesion distribution but represented true model limitations in specific neuroanatomical contexts.

C. Representational Equity and Nonlinear Interactions

Clustering: Significant performance-related clustering was detected in the UMAP latent space across all tumour compartments.
Complex Interactions: The analysis revealed that algorithmic vulnerability arises from the nonlinear conjunction of features (e.g., a young female with a low-grade, non-enhancing Astrocytoma undergoing subtotal resection). No single covariate (e.g., "female" or "low grade") alone flagged this risk, but their joint presence in the latent space did.
Conclusion: Underserved populations cannot be characterized by single demographic labels; they exist in compound regions of the feature space poorly represented in training data (which is often GBM-dominant).

D. Model Trends

Newer models (e.g., BraTS 2023 entrants) generally trended toward higher equity and performance simultaneously.
Crucial Finding: No model provided a formal fairness guarantee. High overall performance did not equate to distributional equity.

4. Key Contributions

Fairboard Dashboard: An open-source, no-code tool (Streamlit-based) that implements all four equity dimensions, allowing researchers and clinicians to audit model fairness without programming expertise.
Comprehensive Framework: The first study to simultaneously evaluate univariate, multivariate, spatial, and representational equity in a large-scale medical imaging context.
Model-Specific Equity Cards: Standardized fairness profiles for 18 architectures, providing a "league table" that ranks models by both accuracy and equity.
Methodological Shift: Demonstrating that patient characteristics (biology, surgery, demographics) are the primary drivers of segmentation variance, often outweighing the choice of deep learning architecture.

5. Significance and Clinical Implications

Beyond Aggregate Metrics: The study proves that optimizing for mean performance can mask severe inequities in specific subgroups (e.g., biopsy patients, non-GBM diagnoses).
Targeted Deployment: Clinicians must consider that AI performance varies by neuroanatomical locus and patient profile. A tool validated on GBM may fail on low-grade gliomas or specific brain regions.
Data Representation: The findings highlight the danger of training on narrow cohorts (e.g., predominantly GBM, IDH-wildtype). Future models require diverse training data to address the "compound" vulnerabilities identified in the representational analysis.
Routine Auditing: The release of Fairboard aims to make equity assessment a routine part of medical AI development, moving the field from "accuracy-only" to "equity-aware" deployment.

In summary, the paper establishes that "who the patient is" matters more than "which model is used" for segmentation accuracy, and introduces a robust, open-source framework to detect and monitor these disparities across multiple dimensions of fairness.