A Pragmatic Note on Evaluating Generative Models with Fréchet Inception Distance for Retinal Image Synthesis

This paper argues that while Fréchet Inception Distance (FID) is a standard metric for general image synthesis, it often fails to align with the practical goals of biomedical generative models, specifically in retinal imaging, where the true value of synthetic data should be evaluated by its impact on downstream tasks like classification and segmentation rather than by statistical distribution matching alone.

Yuli Wu, Fucheng Liu, Rüveyda Yilmaz, Henning Konermann, Peter Walter, Johannes Stegmaier

Published 2026-02-23
📖 5 min read🧠 Deep dive

The Big Idea: The "Fake Resume" Problem

Imagine you are a hiring manager looking to hire a team of chefs. You have a small pile of real resumes (real data). To fill the gaps, you decide to use an AI to generate thousands of fake resumes (synthetic data) to help you train your hiring algorithm.

Now, you need a way to check if these fake resumes are "good."

  • The Old Way (FID): You hire a strict editor to look at the fake resumes. The editor checks if the font looks right, if the grammar is perfect, and if the paper texture feels real. If the fake resume looks exactly like a real one, the editor gives it a high score.
  • The Real Goal: You don't actually care if the font is perfect. You care if the fake resume helps your hiring algorithm learn to spot the best chefs. Does reading these fake resumes actually make your hiring algorithm smarter?

This paper argues that the "Editor" (the metric called FID) is lying to us. Just because a fake resume looks perfect on paper doesn't mean it helps you hire better chefs. In fact, sometimes the "perfect-looking" fake resumes are useless, and the "messy-looking" ones are actually the most helpful.


The Cast of Characters

  1. The Generative Models (The Forgers): These are the AI artists (like StyleGAN and Diffusion models) trying to create fake retinal eye images. Some are great artists; some are just okay.
  2. The Downstream Task (The Hiring Algorithm): This is the real job the AI needs to do: diagnose glaucoma or map layers of the eye. This is the only thing that actually matters.
  3. FID (The Strict Editor): The "Fréchet Inception Distance." It's the industry standard metric. It uses a pre-trained AI (trained on general photos like cats and cars) to compare the "vibe" of real vs. fake images. It assumes that if the fake images look statistically similar to real ones, they are good.

The Experiment: A Taste Test

The researchers set up a massive experiment with two types of eye images:

  • Fundus Photos: Like taking a wide-angle photo of the back of the eye (looking for glaucoma).
  • OCT Scans: Like taking a cross-section slice of the eye (looking at the layers).

They created 24 different versions of fake images using different settings. Some looked very realistic, some looked a bit blurry.

Then, they did two things:

  1. The Editor's Score: They ran the "Strict Editor" (FID) on all 24 versions to see which ones looked most like real eyes.
  2. The Hiring Test: They mixed the fake images with real ones and trained a medical AI to diagnose eye diseases. They checked: Did the fake images actually make the medical AI better?

The Shocking Result

The Editor and the Hiring Manager were speaking different languages.

  • Scenario A (Fundus Photos): The "Strict Editor" said, "This version (SG-10) is the best! It looks perfect!" But when they used that version to train the medical AI, the AI got worse at diagnosing glaucoma. The "perfect" fake images actually confused the medical AI.
  • Scenario B (OCT Scans): The "Strict Editor" said, "This version (DM-7) is the worst!" But when they used it, the medical AI performed okay.

The Analogy:
Imagine you are training a dog to find a lost wallet.

  • FID is like a judge who says, "This fake wallet looks exactly like a real one. It has the right color, the right leather texture, and the right smell. Score: 10/10."
  • The Downstream Task is the dog. The dog doesn't care about the leather texture. The dog needs to find the specific smell of the wallet.
  • The Problem: The AI generated a wallet that looked perfect (high FID) but had the wrong chemical smell. The dog couldn't find it. The AI generated a wallet that looked a bit weird (low FID) but had the right smell. The dog found it easily.

The paper found that FID is obsessed with the "leather texture" (visual similarity) but ignores the "chemical smell" (medical utility).

Why Does This Happen?

The paper digs into why FID fails in medicine:

  1. Wrong Teacher: FID uses an AI trained on general photos (cats, cars, landscapes). It doesn't understand that in an eye scan, a tiny blurry spot might be a life-saving tumor, while a perfect-looking blood vessel might be irrelevant noise.
  2. The "Uncanny Valley" of Data: Sometimes, to make an image look "perfect" to FID, the AI smooths out the weird, rare, or difficult cases. But in medicine, those weird cases are exactly what the doctor needs to learn from! By making the data "too perfect," FID accidentally deletes the most important training examples.

The Takeaway: Stop Judging the Book by Its Cover

The authors conclude that if you are making fake medical data to help train AI doctors, stop using FID as your main judge.

  • Don't ask: "Does this fake image look like a real one?"
  • Do ask: "If I feed this fake image to a medical AI, does the AI get better at its job?"

The Golden Rule: The only true test of a fake medical image is whether it helps a real medical AI perform better. If it looks perfect but doesn't help, it's useless. If it looks a bit weird but saves lives, it's a masterpiece.

Summary in One Sentence

Just because a fake medical image looks perfect to a computer doesn't mean it's useful for training a doctor; the only metric that matters is whether it actually helps the AI diagnose patients better.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →