A Pragmatic Note on Evaluating Generative Models with Fréchet Inception Distance for Retinal Image Synthesis

The Big Idea: The "Fake Resume" Problem

Imagine you are a hiring manager looking to hire a team of chefs. You have a small pile of real resumes (real data). To fill the gaps, you decide to use an AI to generate thousands of fake resumes (synthetic data) to help you train your hiring algorithm.

Now, you need a way to check if these fake resumes are "good."

The Old Way (FID): You hire a strict editor to look at the fake resumes. The editor checks if the font looks right, if the grammar is perfect, and if the paper texture feels real. If the fake resume looks exactly like a real one, the editor gives it a high score.
The Real Goal: You don't actually care if the font is perfect. You care if the fake resume helps your hiring algorithm learn to spot the best chefs. Does reading these fake resumes actually make your hiring algorithm smarter?

This paper argues that the "Editor" (the metric called FID) is lying to us. Just because a fake resume looks perfect on paper doesn't mean it helps you hire better chefs. In fact, sometimes the "perfect-looking" fake resumes are useless, and the "messy-looking" ones are actually the most helpful.

The Cast of Characters

The Generative Models (The Forgers): These are the AI artists (like StyleGAN and Diffusion models) trying to create fake retinal eye images. Some are great artists; some are just okay.
The Downstream Task (The Hiring Algorithm): This is the real job the AI needs to do: diagnose glaucoma or map layers of the eye. This is the only thing that actually matters.
FID (The Strict Editor): The "Fréchet Inception Distance." It's the industry standard metric. It uses a pre-trained AI (trained on general photos like cats and cars) to compare the "vibe" of real vs. fake images. It assumes that if the fake images look statistically similar to real ones, they are good.

The Experiment: A Taste Test

The researchers set up a massive experiment with two types of eye images:

Fundus Photos: Like taking a wide-angle photo of the back of the eye (looking for glaucoma).
OCT Scans: Like taking a cross-section slice of the eye (looking at the layers).

They created 24 different versions of fake images using different settings. Some looked very realistic, some looked a bit blurry.

Then, they did two things:

The Editor's Score: They ran the "Strict Editor" (FID) on all 24 versions to see which ones looked most like real eyes.
The Hiring Test: They mixed the fake images with real ones and trained a medical AI to diagnose eye diseases. They checked: Did the fake images actually make the medical AI better?

The Shocking Result

The Editor and the Hiring Manager were speaking different languages.

Scenario A (Fundus Photos): The "Strict Editor" said, "This version (SG-10) is the best! It looks perfect!" But when they used that version to train the medical AI, the AI got worse at diagnosing glaucoma. The "perfect" fake images actually confused the medical AI.
Scenario B (OCT Scans): The "Strict Editor" said, "This version (DM-7) is the worst!" But when they used it, the medical AI performed okay.

The Analogy:
Imagine you are training a dog to find a lost wallet.

FID is like a judge who says, "This fake wallet looks exactly like a real one. It has the right color, the right leather texture, and the right smell. Score: 10/10."
The Downstream Task is the dog. The dog doesn't care about the leather texture. The dog needs to find the specific smell of the wallet.
The Problem: The AI generated a wallet that looked perfect (high FID) but had the wrong chemical smell. The dog couldn't find it. The AI generated a wallet that looked a bit weird (low FID) but had the right smell. The dog found it easily.

The paper found that FID is obsessed with the "leather texture" (visual similarity) but ignores the "chemical smell" (medical utility).

Why Does This Happen?

The paper digs into why FID fails in medicine:

Wrong Teacher: FID uses an AI trained on general photos (cats, cars, landscapes). It doesn't understand that in an eye scan, a tiny blurry spot might be a life-saving tumor, while a perfect-looking blood vessel might be irrelevant noise.
The "Uncanny Valley" of Data: Sometimes, to make an image look "perfect" to FID, the AI smooths out the weird, rare, or difficult cases. But in medicine, those weird cases are exactly what the doctor needs to learn from! By making the data "too perfect," FID accidentally deletes the most important training examples.

The Takeaway: Stop Judging the Book by Its Cover

The authors conclude that if you are making fake medical data to help train AI doctors, stop using FID as your main judge.

Don't ask: "Does this fake image look like a real one?"
Do ask: "If I feed this fake image to a medical AI, does the AI get better at its job?"

The Golden Rule: The only true test of a fake medical image is whether it helps a real medical AI perform better. If it looks perfect but doesn't help, it's useless. If it looks a bit weird but saves lives, it's a masterpiece.

Summary in One Sentence

Just because a fake medical image looks perfect to a computer doesn't mean it's useful for training a doctor; the only metric that matters is whether it actually helps the AI diagnose patients better.

1. Problem Statement

The paper addresses a critical misalignment in the evaluation of generative models (such as GANs and Diffusion models) within the biomedical domain, specifically for retinal imaging.

Current Practice: The field predominantly relies on Fréchet Inception Distance (FID) and its variants (e.g., KID, CMMD, FLD) to evaluate synthetic data quality. These metrics measure the statistical distance between feature distributions of real and generated images, assuming the features follow a multivariate Gaussian distribution.
The Conflict: In biomedical applications, the primary goal of generative models is data enrichment to improve downstream tasks (e.g., disease classification, layer segmentation). The authors argue that high perceptual similarity (low FID) does not necessarily translate to improved performance in these downstream tasks.
Core Question: Do feature-distance metrics (like FID) reliably rank generative models based on their utility for downstream training, or do they fail to capture the specific characteristics needed for biomedical analysis?

2. Methodology

The authors conducted a comprehensive empirical study to test the correlation between generative evaluation metrics and downstream task performance.

Datasets and Modalities

Color Fundus Photography: Used the AIROGS dataset (~101k images) for binary classification (Referable Glaucoma vs. No Referable Glaucoma). The task involved addressing class imbalance by augmenting the minority class with synthetic data.
Optical Coherence Tomography (OCT): Used the MICCAI GOALS Challenge dataset (100 images) for pixel-wise segmentation of three retinal layers (RNFL, GCIPL, CL). This represents a small-sample regime typical in biomedical segmentation.

Generative Models

Three distinct model families were evaluated to create a spectrum of performance:

StyleGAN3: Trained on Glaucoma fundus images. Variants were selected based on training checkpoints with decreasing FID scores.
Medfusion (Latent Diffusion): Fundus image generation with varying sampling steps ( $t$ ).
DDPM (Denoising Diffusion Probabilistic Model): OCT image generation conditioned on segmentation sketches, with varying sampling steps.

Evaluation Framework

Generative Metrics: Seven metrics were computed between synthetic and real test data:
- Distances: Fréchet (FID, Clean-FID, CLIP-FD, RETFound-FD), MMD (KID, CMMD), and KL Divergence (FLD).
- Feature Extractors: ImageNet-pretrained Inception-v3, CLIP, DINOv2, and RETFound (a retinal-specific foundation model).
Downstream Tasks:
- Classification: ResNet-50 and Swin-Tiny trained on mixed real/synthetic data; evaluated using F1-score for the minority class.
- Segmentation: U2-Net and TransUNet trained on mixed data; evaluated using weighted Dice scores.
Statistical Analysis: The authors calculated Kendall's $\tau$ rank correlation to determine if the ranking of models by generative metrics matched the ranking by downstream performance.

3. Key Contributions

Demonstration of Misalignment: The study provides robust evidence that FID and its variants (even those using domain-specific feature extractors like RETFound) do not correlate with downstream task performance in retinal imaging. In several cases (specifically StyleGAN3), the correlation was significantly negative (inverted), meaning models with "better" FID scores actually degraded downstream performance.
Redundancy of Metrics: The authors found that different feature-distance metrics (FID, KID, CMMD, etc.) are highly redundant. They consistently rank models in the same order, yet this shared ranking fails to predict utility for classification or segmentation.
Feature Sparsity and Entropy Analysis: The paper analyzes the internal properties of feature vectors (sparsity and entropy) across different extractors. It reveals that domain-specific models (RETFound) do not necessarily produce feature distributions that are better aligned with downstream utility compared to general models (Inception-v3).
Pragmatic Recommendation: The authors argue that the "gold standard" for evaluating generative models in biomedical contexts should be downstream task performance (training a classifier/segmenter with synthetic data) rather than static distributional metrics.

4. Key Results

Correlation Analysis:
- For Diffusion Models (Medfusion, DDPM), the correlation between generative metrics and downstream performance was non-significant ( $p \geq 0.05$ ). The metrics failed to capture which models were useful for training.
- For StyleGAN3 (Fundus), the correlation was significantly negative ( $p < 0.01$ ). Models with lower FID scores (perceived as higher quality) resulted in lower F1 scores in the downstream classifier.
Metric Consistency: Despite using different feature extractors (Inception, CLIP, DINOv2, RETFound) and distance measures, all metrics exhibited strong internal correlation ( $\tau > 0.7$ among themselves), confirming they measure similar "perceptual" properties that are irrelevant to the specific downstream task.
Small Sample Instability: The study noted that FID is particularly unstable in small-sample regimes (like the OCT dataset), but even in larger datasets (Fundus), the fundamental misalignment persisted.

5. Significance and Implications

Paradigm Shift in Evaluation: The paper challenges the community's reliance on FID as a proxy for data utility in biomedical AI. It suggests that optimizing for FID may lead researchers to select models that are perceptually realistic but statistically or semantically unsuitable for training diagnostic models.
Domain Specificity is Not a Panacea: The finding that RETFound (a model pretrained on retinal images) did not outperform ImageNet-pretrained Inception-v3 in aligning with downstream tasks suggests that simply using a domain-specific feature extractor is insufficient to fix the evaluation problem.
Future Directions: The authors advocate for task-based evaluation as the primary criterion for model selection. They propose that future research should focus on developing efficient proxy metrics or surrogate modeling (e.g., Bayesian optimization) that can predict downstream utility without the prohibitive computational cost of full downstream training loops.

In conclusion, the paper serves as a "pragmatic note" warning researchers that perceptual similarity $\neq$ task utility. For retinal image synthesis, the only reliable metric for a generative model's success is its ability to improve the performance of the specific clinical task it is intended to support.