Anatomical Accuracy of Generative AI for Congenital Heart Disease Illustrations: Gemini NanoBanana Versus ChatGPT Models in a Blinded Comparative Study

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a class of students how to fix a very complicated, custom-made watch. You need a picture of the watch's gears to show them where the springs go.

In the past, you would hire a master watchmaker (an expert) to draw the picture perfectly. But now, you have a new, super-fast robot artist (Generative AI) that can draw a picture of a watch in seconds for free.

The big question: Can you trust the robot's drawing to teach your students, or will it look pretty but get the gears wrong?

This study is exactly that, but instead of watches, the "gears" are the human heart, specifically hearts born with defects (Congenital Heart Disease). The researchers asked: Can AI draw accurate medical pictures of these complex hearts?

Here is the breakdown of what they found, using simple analogies:

1. The Contestants

The researchers set up a "blind taste test" (like a mystery food challenge). They gathered 20 doctors (some heart experts, some general doctors) and showed them pictures of 20 different heart conditions. The doctors didn't know who drew which picture. The pictures came from three sources:

The Human Expert: A picture drawn by a doctor and then tweaked by an AI to look like a drawing. (The "Gold Standard").
Gemini NanoBanana: A Google AI model.
ChatGPT (versions 5 and Images): An OpenAI model.

2. The Results: The "Pretty but Wrong" Problem

The Human Expert (The Gold Standard):
Think of this as a master architect's blueprint. It was the most accurate. About 48% of the time, the doctors said, "Yes, this is exactly right." It was the only one they felt comfortable using in a classroom without changing anything.

Gemini NanoBanana (The "Good Student"):
This AI was the runner-up. It was better than the others but still made mistakes.

Accuracy: Only about 23% of its drawings were correct.
The Vibe: Interestingly, the doctors thought these pictures were the most beautiful and attractive. It's like a student who draws a stunning, colorful picture of a car, but the wheels are on the roof and the engine is in the trunk. It looks cool, but it doesn't work.
Verdict: You could use it if you fix the mistakes first.

ChatGPT (The "Confident Hallucinator"):
This was the biggest disappointment.

Accuracy: Only about 3% of the drawings were correct.
The Problem: In 86% of cases, the doctors said the picture was "fabricated" (made up). The AI drew hearts with extra chambers, missing valves, or blood vessels going in the wrong direction.
The Danger: Because the pictures looked so realistic and confident, a student might believe them and learn the wrong anatomy. It's like a GPS that confidently tells you to drive into a lake because it "thinks" that's the fastest route.

3. The "Label" Trouble

The researchers also checked the text labels inside the pictures (e.g., "Aorta," "Left Ventricle").

Human: Labels were correct.
Gemini: Labels were okay, but sometimes mixed up.
ChatGPT: The labels were mostly nonsense. The AI would point to a valve and label it "The Brain" or put the label for the "Aorta" on the "Pulmonary Artery." It's like a museum guide who points to a painting and says, "This is a famous sculpture."

4. The Expert vs. The Generalist

The study found something interesting about who was judging the pictures:

Heart Specialists: They were the strictest critics. They spotted the tiny errors immediately.
General Doctors: They were a bit more lenient. Because the AI pictures looked "pretty" and "professional," the general doctors were more likely to think, "Oh, that looks good enough," even if the anatomy was wrong.
The Lesson: If you aren't an expert, you might be fooled by a pretty picture that is actually wrong.

5. The Bottom Line (The Takeaway)

The study concludes that AI is a great "draftsman," but a terrible "final artist" for medical education.

Don't use AI images directly in class. If you show a student a ChatGPT heart, you are teaching them lies.
Use AI as a starting point. You can ask AI to "draw a heart," get a rough sketch, and then have a real doctor fix the errors.
Gemini is better than ChatGPT for this specific job, but neither is ready to replace a human medical illustrator yet.

In short: AI is like a very fast, very confident intern who loves to draw. They will hand you a picture in seconds that looks amazing. But if you don't have a senior doctor check their work, they will accidentally teach your students that the heart has a third ear or that blood flows backward. Always have a human expert review the AI's work before showing it to anyone.

1. Problem Statement

Generative Artificial Intelligence (GenAI) is increasingly utilized in medical education to create visual aids. However, in complex anatomical domains like Congenital Heart Disease (CHD), the anatomical fidelity of AI-generated images remains unvalidated. CHD requires precise spatial understanding of 3D relationships between chambers, valves, and vessels. Current concerns include:

Anatomical Fabrication: AI models may produce visually compelling but structurally incorrect images (e.g., wrong vessel branching or chamber orientation).
Educational Risk: Learners may internalize incorrect mental models if they trust "plausible-looking" AI images.
Lack of Comparative Data: There is insufficient evidence comparing different state-of-the-art models (e.g., ChatGPT vs. Gemini) specifically for CHD illustrations, nor is there data on how these models compare to expert-curated human-modified images.

2. Methodology

The study employed a blinded, comparative, cross-sectional design to evaluate the quality of AI-generated CHD illustrations.

Image Generation:
- Models Tested: OpenAI's ChatGPT-5, ChatGPT-Images (DALL-E 3 variant), and Google's Gemini NanoBanana.
- Reference Standard: A "Human-Modified" image set derived from authoritative open-access sources, rendered via a third AI platform to match the visual style of the AI outputs (ensuring blinding).
- Scope: 21 anatomical categories (20 specific CHD types + normal heart).
- Volume: 168 total images generated (3 per CHD type for ChatGPT-5 and NanoBanana; 1 for ChatGPT-Images; 1 human-modified reference).
- Prompting: Standardized prompt: "Draw an accurate medical illustration of [Condition] designed for medical students, with clear and correct text labels inside the image to clarify the anatomical structures."
Evaluation Framework:
- Evaluators: 20 physicians (10 CHD experts/pediatric cardiologists and 10 non-specialists).
- Blinding: Images were presented in randomized order via SurveyMonkey without source identification.
- Scoring Metrics (4 Domains, 1–3 scale each, Total 4–12):
  1. Anatomical Accuracy: Accurate, Partially Correct, or Fabricated/Incorrect.
  2. Label Usefulness: Useful, Midway, or Useless/Incorrect.
  3. Visual Attractiveness: Attractive, Midway, or Not Attractive.
  4. Educational Suitability: Usable "As Is," Usable after Modification, or Not Suitable.
Statistical Analysis:
- Descriptive statistics, Chi-square tests, Mann–Whitney U, and Kruskal–Wallis tests.
- Multivariable Analysis: A mixed-effects Generalized Linear Model (GLM) with Gamma distribution to predict overall quality, controlling for evaluator role and lesion complexity.

3. Key Contributions

First Blinded Comparative Study: Provides the first head-to-head comparison of Gemini NanoBanana against ChatGPT-5 and ChatGPT-Images specifically for complex CHD anatomy.
Human-Modified Baseline: Introduces a rigorous "human-modified" control group that was visually harmonized with AI outputs to eliminate bias, establishing a realistic benchmark for "expert-curated" AI-assisted workflows.
Expert vs. Non-Expert Discrepancy: Quantifies the gap in critical evaluation between cardiology specialists and general medical professionals, highlighting that non-specialists are more likely to rate inaccurate images as acceptable.
Model-Specific Performance: Demonstrates that performance is not uniform across GenAI; model architecture significantly dictates anatomical fidelity.

4. Key Results

A. Anatomical Accuracy

Human-Modified Images: Highest accuracy (48.3% rated accurate).
Gemini NanoBanana: Moderate accuracy (22.7% rated accurate).
ChatGPT-5 & ChatGPT-Images: Extremely low accuracy. The vast majority were rated as Fabricated/Incorrect (86.3% for ChatGPT-5; 85.2% for ChatGPT-Images). Only ~3% were rated accurate.

B. Educational Usability ("As Is")

Human-Modified: 37.9% usable immediately.
NanoBanana: 13.1% usable immediately (47.6% usable after modification).
ChatGPT Systems: ≤2.1% usable immediately; >89% deemed not useful for education.

C. Visual Attractiveness vs. Accuracy

NanoBanana was rated the most attractive (34.6%), outperforming even the human-modified images (16.4%).
ChatGPT images were rated as "Not Attractive" by ~70% of evaluators.
Critical Finding: There is a dissociation between visual appeal and anatomical truth. NanoBanana produced the most "pretty" but still significantly inaccurate images.

D. Multivariable Analysis (Predictors of Quality)

Image Source was the strongest predictor of quality.
- NanoBanana: Risk Ratio (RR) = 0.94 (closest to human standard, but still significantly lower).
- ChatGPT-Images: RR = 0.60 (40% lower quality).
- ChatGPT-5: RR = 0.57 (43% lower quality).
Evaluator Role: Non-cardiology experts rated images significantly higher (RR = 1.05) than cardiology specialists, indicating specialists are more critical of anatomical errors.

E. Correlations

Strong positive correlation between Anatomical Accuracy and Educational Usability ( $\rho = 0.829$ ).
Weak correlation between Visual Attractiveness and Educational Usability, reinforcing that aesthetics do not equate to educational value.

5. Significance and Conclusion

Current Limitations: General-purpose GenAI models are not ready to replace expert-curated medical illustrations for CHD education. They frequently hallucinate critical anatomical structures, particularly in complex lesions (e.g., Pulmonary Atresia with VSD).
Model Selection Matters: While Gemini NanoBanana outperformed ChatGPT variants, it still fell short of human standards. No tested model achieved high enough fidelity for standalone use.
The "Expert-in-the-Loop" Necessity: The study concludes that AI-generated images should only be used as drafts within an expert-reviewed workflow. They require substantial modification (correcting labels, fixing vessel trajectories, and reorienting anatomy) before they are safe for educational use.
Governance Implications: Institutions must implement strict governance frameworks requiring domain-expert validation of AI content before curricular integration. Relying on AI without supervision poses a risk of teaching incorrect anatomy, which is particularly dangerous in high-stakes fields like congenital heart surgery.

Final Verdict: AI is a powerful tool for rapid prototyping and visual variation, but it currently lacks the anatomical precision required for independent medical instruction. Human expertise remains the non-negotiable standard for anatomical truth.