FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI

The paper introduces FunnyNodules, a fully parameterized synthetic dataset of lung nodule-like shapes with controllable visual attributes and known decision rules, designed to systematically evaluate and benchmark explainable AI models by verifying whether they learn correct attribute-target relations and align their attention with relevant diagnostic features.

Luisa Gallée, Yiheng Xiong, Meinrad Beer, Michael Götz

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot how to be a doctor. You show it thousands of X-rays of lung nodules (little lumps in the lung) and tell it, "This one is dangerous, this one is safe." The robot learns to get the diagnosis right. But here's the problem: Did it learn the right reasons?

Maybe the robot is just guessing based on the background noise in the image, or maybe it's looking at the wrong part of the lung. In the real world, we often can't tell why the robot made a mistake because we don't have a "teacher's answer key" that explains exactly which visual features (like roundness or sharp edges) led to the diagnosis.

This is where the paper introduces FunnyNodules.

The "Lego" Analogy: Building a Perfect Test

Think of real medical data like a messy, chaotic pile of rocks. Every rock is different, some are hidden, and we don't know exactly why a geologist picked one over another.

FunnyNodules is like a Lego set for medical AI. Instead of messy rocks, the researchers built a factory that creates perfect, synthetic lung nodules out of digital "Lego bricks."

Here is how it works:

  1. The Bricks (Attributes): The factory has specific knobs they can turn to change the nodule's look. They can make it:
    • More or less round.
    • Have spiky edges or smooth edges.
    • Be big or small.
    • Be dark or bright.
    • Have a texture inside or be empty.
  2. The Rulebook (The Logic): The researchers write a strict rulebook. For example: "If the nodule is spiky AND has a texture inside, it is dangerous (Target 5). If it is round and smooth, it is safe (Target 1)."
  3. The Perfect Answer Key: Because the computer built the image using these rules, it knows exactly why every single image is labeled the way it is. It has a perfect "ground truth."

Why is this "Funny"?

The name "FunnyNodules" comes from the idea that these aren't real, scary tumors. They are abstract, cartoon-like shapes. They are "funny" because they are silly, made-up drawings, but they are incredibly useful for testing.

What Can We Do With This Lego Set?

The paper shows three main ways scientists can use this to test AI:

1. The "What If?" Game (Testing Reasoning)

In the real world, you can't easily change just one thing about a patient's lung. But with FunnyNodules, you can.

  • The Test: You show the AI a nodule. Then, you say, "Okay, keep everything the same, but make it spikier."
  • The Goal: Does the AI change its mind? If the rulebook says "spiky = dangerous," the AI should say, "Oh, now it's dangerous!"
  • The Result: If the AI ignores the spikes and keeps saying it's safe, we know the AI is "cheating" or looking at the wrong clues. It's like a student who memorized the answer key but didn't learn the math.

2. The "Trust Score" (Checking if the AI is Honest)

Sometimes an AI gets the right answer for the wrong reason.

  • The Analogy: Imagine a student taking a math test. They get the answer "42" correct. But when you ask them how they got it, they say, "I guessed."
  • The Test: The researchers measure two things:
    1. How good is the AI at spotting the features (e.g., "Is it spiky?")?
    2. How good is the AI at making the final diagnosis?
  • The Result: If the AI is great at spotting features but terrible at diagnosing, it's confused. If it's great at diagnosing but terrible at spotting features, it's just guessing. The paper creates a "Trust Index" to tell you if you can trust the AI's logic.

3. The "Flashlight" Test (Checking Attention)

Explainable AI (xAI) often tries to draw a "heat map" on an image to show where the AI is looking.

  • The Problem: In real life, we don't know if the AI is looking at the right spot.
  • The Solution: With FunnyNodules, the researchers know exactly where the "spikes" are because they built them there.
  • The Test: They shine a "flashlight" (the AI's attention map) on the image. Does the light shine on the spikes?
  • The Result: In their tests, they found that even advanced AI models often looked at the whole blob instead of zooming in on the specific spikes that mattered. This helps developers fix the AI's "vision."

The Big Picture

The authors aren't saying, "Stop using real patient data." Real data is still necessary for the final check.

Instead, they are saying: "Before you test your AI on real, messy patients, test it on our perfect Lego set."

It's like a flight simulator. Pilots don't learn to fly on a real plane with a storm outside; they learn in a simulator where they can crash a thousand times without hurting anyone. FunnyNodules is the flight simulator for medical AI. It lets researchers crash their models, figure out why they crashed, and fix the logic before they ever touch a real patient's data.

In short: It's a customizable, perfect-test-bench that helps us build AI that doesn't just guess the right answer, but actually understands why it's right.