FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI

Imagine you are trying to teach a robot how to be a doctor. You show it thousands of X-rays of lung nodules (little lumps in the lung) and tell it, "This one is dangerous, this one is safe." The robot learns to get the diagnosis right. But here's the problem: Did it learn the right reasons?

Maybe the robot is just guessing based on the background noise in the image, or maybe it's looking at the wrong part of the lung. In the real world, we often can't tell why the robot made a mistake because we don't have a "teacher's answer key" that explains exactly which visual features (like roundness or sharp edges) led to the diagnosis.

This is where the paper introduces FunnyNodules.

The "Lego" Analogy: Building a Perfect Test

Think of real medical data like a messy, chaotic pile of rocks. Every rock is different, some are hidden, and we don't know exactly why a geologist picked one over another.

FunnyNodules is like a Lego set for medical AI. Instead of messy rocks, the researchers built a factory that creates perfect, synthetic lung nodules out of digital "Lego bricks."

Here is how it works:

The Bricks (Attributes): The factory has specific knobs they can turn to change the nodule's look. They can make it:
- More or less round.
- Have spiky edges or smooth edges.
- Be big or small.
- Be dark or bright.
- Have a texture inside or be empty.
The Rulebook (The Logic): The researchers write a strict rulebook. For example: "If the nodule is spiky AND has a texture inside, it is dangerous (Target 5). If it is round and smooth, it is safe (Target 1)."
The Perfect Answer Key: Because the computer built the image using these rules, it knows exactly why every single image is labeled the way it is. It has a perfect "ground truth."

Why is this "Funny"?

The name "FunnyNodules" comes from the idea that these aren't real, scary tumors. They are abstract, cartoon-like shapes. They are "funny" because they are silly, made-up drawings, but they are incredibly useful for testing.

What Can We Do With This Lego Set?

The paper shows three main ways scientists can use this to test AI:

1. The "What If?" Game (Testing Reasoning)

In the real world, you can't easily change just one thing about a patient's lung. But with FunnyNodules, you can.

The Test: You show the AI a nodule. Then, you say, "Okay, keep everything the same, but make it spikier."
The Goal: Does the AI change its mind? If the rulebook says "spiky = dangerous," the AI should say, "Oh, now it's dangerous!"
The Result: If the AI ignores the spikes and keeps saying it's safe, we know the AI is "cheating" or looking at the wrong clues. It's like a student who memorized the answer key but didn't learn the math.

2. The "Trust Score" (Checking if the AI is Honest)

Sometimes an AI gets the right answer for the wrong reason.

The Analogy: Imagine a student taking a math test. They get the answer "42" correct. But when you ask them how they got it, they say, "I guessed."
The Test: The researchers measure two things:
1. How good is the AI at spotting the features (e.g., "Is it spiky?")?
2. How good is the AI at making the final diagnosis?
The Result: If the AI is great at spotting features but terrible at diagnosing, it's confused. If it's great at diagnosing but terrible at spotting features, it's just guessing. The paper creates a "Trust Index" to tell you if you can trust the AI's logic.

3. The "Flashlight" Test (Checking Attention)

Explainable AI (xAI) often tries to draw a "heat map" on an image to show where the AI is looking.

The Problem: In real life, we don't know if the AI is looking at the right spot.
The Solution: With FunnyNodules, the researchers know exactly where the "spikes" are because they built them there.
The Test: They shine a "flashlight" (the AI's attention map) on the image. Does the light shine on the spikes?
The Result: In their tests, they found that even advanced AI models often looked at the whole blob instead of zooming in on the specific spikes that mattered. This helps developers fix the AI's "vision."

The Big Picture

The authors aren't saying, "Stop using real patient data." Real data is still necessary for the final check.

Instead, they are saying: "Before you test your AI on real, messy patients, test it on our perfect Lego set."

It's like a flight simulator. Pilots don't learn to fly on a real plane with a storm outside; they learn in a simulator where they can crash a thousand times without hurting anyone. FunnyNodules is the flight simulator for medical AI. It lets researchers crash their models, figure out why they crashed, and fix the logic before they ever touch a real patient's data.

In short: It's a customizable, perfect-test-bench that helps us build AI that doesn't just guess the right answer, but actually understands why it's right.

1. Problem Statement

In medical image analysis, while model performance (accuracy) is rigorously evaluated, the correctness of model reasoning (i.e., whether a model makes the right decision for the right reasons) is often insufficiently assessed. This gap exists because:

Lack of Ground Truth: Evaluating Explainable AI (xAI) requires ground truth annotations for visual explanations and attribute reasoning at the sample level. Such annotations are rare in medical domains due to the high cost of expert labeling and the inherent limitations of dataset sizes.
Complexity of Real Data: Real-world medical datasets suffer from inter-rater variability, incomplete annotations, and uncontrolled confounding factors, making it difficult to systematically isolate whether a model learned the correct causal relationships between visual features and diagnostic labels.
Evaluation Limitations: Existing synthetic datasets (e.g., using GANs or Diffusion models) focus on realism rather than controllable attribute relationships, while natural image datasets (e.g., CUB-200) do not address medical-specific attributes.

2. Methodology: The FunnyNodules Framework

The authors introduce FunnyNodules, a fully parameterized, synthetic image dataset designed specifically for the systematic evaluation of AI and xAI methods in medical imaging.

A. Data Generation

Concept: The dataset generates abstract, lung nodule-like shapes as grayscale images.
Attributes: Each nodule is defined by six controllable visual attributes:
1. Roundness ( $r$ ): 1 (round) to 5 (oval).
2. Spiculation ( $sp$ ): 1 (none) to 5 (marked).
3. Edge Sharpness ( $es$ ): 1 (sharp) to 5 (soft).
4. Size ( $s$ ): 1 (small) to 5 (big).
5. Intensity ( $i$ ): 1 (dark) to 5 (bright).
6. Internal Structure ( $is$ ): 0 (absent) or 1 (present).
Target Class: The diagnostic label (1–5) is derived from a predefined, programmable decision rule (Algorithm 1) that combines these attributes. This allows the authors to define complex, non-linear relationships (e.g., the effect of roundness depends on the presence of internal structure).
Generation Process: Nodules are rendered as elliptical shapes with geometric, boundary, and intensity perturbations based on the attributes. Spiculation is simulated via angular contour perturbations, edge sharpness via Gaussian blurring, and internal structure via textured subregions.
Ground Truth: The framework automatically generates complete ground truth, including target labels, attribute scores, and precise Region of Interest (ROI) masks for each attribute, eliminating inter-rater variability.

B. Customizability

The framework is highly flexible, allowing researchers to:

Adjust dataset complexity (linear vs. complex correlated rules).
Modify class balance and attribute scales.
Introduce background noise or structures to test robustness.
Change input channels (e.g., grayscale for CT/X-ray, RGB for dermatology).

C. Evaluation Strategy

The paper demonstrates the dataset's utility through several evaluation metrics and methods:

Reasoning Correctness: Systematically varying a single attribute while holding others constant to check if the model's prediction shifts according to the ground truth rule.
Trust Index (TI): A metric defined as $TI = P_{target} - \frac{1}{N}\sum A_i / P_{target}$ $T I = P_{t a r g e t} - \frac{1}{N} \sum A_{i} / P_{t a r g e t}$ , where $P_{target}$ $P_{t a r g e t}$ is target accuracy and $A_i$ $A_{i}$ is attribute prediction accuracy.
- $TI > 0$ : Model predicts targets well but relies on wrong attributes (untrustworthy).
- $TI < 0$ : Model learns attributes well but fails to map them to the target.
Contrastivity: Measuring how strongly an explanation distinguishes the target outcome from alternatives by quantifying the shift in prediction when an attribute changes ( $\Delta_{target}$ ).
Attention Alignment: Comparing model attention maps (e.g., from HierViT) against the ground-truth attribute ROI masks to see if the model focuses on the correct visual regions.
Prototype Analysis: Evaluating prototype-based models (e.g., Proto-Caps) to see if the learned prototypes accurately reflect the ground-truth attributes.

3. Key Contributions

FunnyNodules Dataset: A novel, fully synthetic dataset tailored for medical imaging that provides complete ground truth for both diagnostic labels and the underlying visual attributes/ROIs.
Controlled Reasoning Evaluation: A framework to systematically test if models learn the correct decision rules rather than spurious correlations, by isolating specific attribute variations.
New Evaluation Metrics: Introduction of the Trust Index (TI) to quantify the alignment between attribute learning and target prediction, and the use of Contrastive Prediction Differences to measure explanation quality.
Scalability: The ability to generate unlimited data with varying complexity, overcoming the data scarcity and annotation bottlenecks of real medical datasets.
Open Source: The dataset generation code and experimental results are publicly available.

4. Results

The authors evaluated several models (ResNet-50, DenseNet-121, HierViT, Proto-Caps, Concept Bottleneck) using the dataset:

Reasoning Analysis: Standard models (ResNet, DenseNet) generally followed ground truth trends for simple attributes but showed deviations in complex, conditional rules (e.g., the interaction between roundness and internal structure).
Trust Index:
- Models trained on small datasets showed negative TI values (strong attribute learning, weak target mapping).
- As training data increased, TI approached zero, indicating balanced learning.
- The Concept Bottleneck model was highly sensitive to data size, showing poor performance with limited samples.
Attention Alignment: Visual analysis (Figure 4) revealed that while models like HierViT attended to the general nodule, their attention maps did not tightly align with specific attribute ROIs (e.g., spiculation spikes or edge borders), highlighting a disconnect between "what" the model predicts and "where" it looks.
Prototype Correctness: Prototype-based models achieved high accuracy in selecting the correct attribute prototypes (e.g., >97% for roundness), demonstrating the dataset's utility in validating prototype-based reasoning.

5. Significance and Conclusion

Bridging the Evaluation Gap: FunnyNodules addresses the critical lack of ground truth for xAI evaluation in medicine. It allows researchers to verify if a model's "explanation" is actually faithful to its decision process.
Diagnostic Tool for Model Development: By identifying specific weaknesses (e.g., a model that predicts well but ignores spiculation), developers can adapt architectures or loss functions to improve reasoning.
Complement to Real Data: While not a replacement for real-world clinical validation, FunnyNodules provides a controlled sandbox to isolate variables, test robustness, and benchmark xAI methods before costly human-in-the-loop studies.
Future Impact: The framework supports the development of more transparent and trustworthy medical AI systems by enabling rigorous, systematic analysis of model behavior under well-defined conditions.