Fairboard: a quantitative framework for equity assessment of healthcare models

This paper introduces Fairboard, an open-source dashboard for equity assessment, and utilizes it to demonstrate that patient-specific clinical factors and neuroanatomical location, rather than model architecture, are the primary drivers of performance variance and bias across 18 brain tumor segmentation models, highlighting the critical need for formal fairness guarantees in medical AI.

James K. Ruffle, Samia Mohinta, Chris Foulon, Mohamad Zeina, Zicheng Wang, Sebastian Brandner, Harpreet Hyare, Parashkev Nachev

Published 2026-04-14
📖 5 min read🧠 Deep dive

Imagine you have built a fleet of 18 different robot chefs. Their job is to look at a complex, messy kitchen (a patient's brain scan) and perfectly slice out the burnt, dangerous parts (the tumor) without touching the good ingredients.

For years, the only question people asked was: "How fast and accurately can these robots chop?" If a robot could chop 95% of the time, it was considered a success.

But this paper asks a much more important question: "Does every robot chop equally well for every type of kitchen, or do some robots struggle when the kitchen belongs to a specific kind of person?"

The authors of this paper, led by Dr. James Ruffle, built a new tool called Fairboard to answer this. Here is the breakdown of their findings using simple analogies.

1. The Big Discovery: The "Chef" Matters Less Than the "Kitchen"

The researchers tested 18 different AI models (the robots) on 648 different patients (the kitchens). They found something surprising:

It didn't matter which robot you used; it mattered who was cooking.

Think of it like this: If you give a master chef and a novice chef the same difficult, sticky, weirdly shaped dough, they will both struggle. But if you give them a perfect, pre-shaped cookie, they will both succeed.

The study found that the patient's specific biology (their age, sex, the specific type of tumor, and how much of it was already removed by a surgeon) explained far more about whether the AI would succeed or fail than the actual AI model itself.

  • The Analogy: It's not that one car is better than another; it's that some cars struggle more on muddy roads than on paved ones. The "muddy road" here is the patient's specific medical condition.

2. The Four Ways They Checked for Fairness

The authors didn't just look at the final score. They looked at the problem from four different angles, like inspecting a diamond from every side:

  • The Group Check (Univariate): They asked, "Do the robots perform worse for older people than younger people?" or "Do they fail more often on women than men?"
    • Result: Yes, there were gaps. For example, the robots were generally worse at slicing tumors that had already been partially removed by a surgeon (like trying to finish a puzzle with half the pieces missing).
  • The Prediction Check (Multivariate): They asked, "If we look at a patient's entire profile (age, sex, tumor type), can we predict if the robot will mess up?"
    • Result: Yes. The patient's profile was a crystal ball for the robot's mistakes.
  • The Map Check (Spatial): They looked at where in the brain the robots made mistakes.
    • Result: The robots weren't failing randomly. They had "blind spots." For instance, they were consistently worse at seeing tumors in the right side of the brain compared to the left, or in specific deep areas. It's like a security camera that has a clear view of the front door but a blurry view of the back window.
  • The "Vibe" Check (Representational): This was the most complex part. They used a high-tech map (called UMAP) to plot patients based on everything about them (their DNA, their age, the shape of their tumor).
    • Result: They found that patients who were "different" in a complex, combined way (e.g., a young woman with a rare, low-grade tumor) formed a specific cluster where the robots consistently failed. It wasn't just one thing (like age) causing the failure; it was the unique combination of traits that the robots hadn't seen enough of during training.

3. The "Newer is Better" Myth

The team checked if the newest, fanciest robots (AI models from 2023) were fairer than the older ones (from 2018).

  • The Verdict: The newer models were slightly better and more fair, but none of them were perfect. None of the 18 models could guarantee they wouldn't make a mistake on a specific type of patient. There is no "magic bullet" AI yet that works perfectly for everyone.

4. The Solution: Fairboard

The authors didn't just point out the problem; they built a tool to fix it. They released Fairboard, which is like a "Fairness Dashboard" for doctors and scientists.

  • No-Code: You don't need to be a computer programmer to use it. It's like a simple app where you upload your data.
  • The Function: It takes your AI model and your patient data, runs it through those four "checks" (Group, Prediction, Map, and Vibe), and gives you a report card.
  • The Goal: It tells you, "Hey, your AI is great at detecting tumors in men, but it's blind to tumors in women," or "Your AI fails when the tumor is in the back of the brain."

Why This Matters

In the past, if an AI said, "I am 90% accurate," doctors assumed it was safe for everyone. This paper says, "Wait a minute. That 90% might be 99% for some people and 60% for others."

If we deploy these tools without checking, we risk building a healthcare system where the AI works great for the "average" patient (often older, male, with common tumors) but fails the most vulnerable or unique patients.

The Takeaway:
The paper is a wake-up call. We can't just build smarter robots; we have to build robots that understand all the different kitchens they might enter. And thanks to Fairboard, we now have a simple way to check if our robots are truly fair before we let them into the operating room.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →