LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

This paper introduces LMOD+, a large-scale multimodal ophthalmology benchmark dataset and evaluation framework featuring 32,633 annotated instances across 12 conditions and 5 imaging modalities, designed to advance and systematically assess the capabilities of multimodal large language models in vision-threatening disease diagnosis, staging, and bias detection.

Zhenyue Qin, Yang Liu, Yu Yin, Jinyu Ding, Haoran Zhang, Anran Li, Dylan Campbell, Xuansheng Wu, Ke Zou, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine the world of eye care (ophthalmology) as a vast, complex library. For years, the librarians (doctors) have been overwhelmed. There are too many books (patients), not enough staff, and many people are waiting too long to get their eyes checked, leading to preventable blindness.

Enter Artificial Intelligence (AI). Specifically, a new generation of "super-readers" called Multimodal Large Language Models (MLLMs). Think of these not just as text-chatting robots, but as brilliant interns who can read a patient's history and look at a picture of their eye simultaneously to make a diagnosis.

However, there's a problem: How do we test if these interns are actually good?

The Problem: The Wrong Test

Previously, researchers tried to test these AI interns using old, simple quizzes designed for much older, dumber computers.

  • The Old Way: Show a picture of an eye, and the computer just had to say "Yes" or "No" (e.g., "Is there glaucoma?"). It was like a multiple-choice test where you just circle a letter.
  • The New Reality: Modern AI is like a human doctor. It needs to look at the picture, explain why it thinks there's a disease, describe the anatomy, and even guess the patient's age or gender based on the eye. The old "Yes/No" tests were too simple and didn't measure the AI's true intelligence or its ability to explain its reasoning.

The Solution: LMOD+ (The Ultimate Eye-Exam Simulator)

The authors of this paper created LMOD+, which is essentially a massive, high-tech "simulator" or "training ground" for these AI doctors.

Here is what makes LMOD+ special, using some everyday analogies:

1. A Massive, Diverse Library (The Dataset)
Instead of just a few photos, LMOD+ contains over 32,000 eye images.

  • The Variety: It's not just one type of photo. It includes:
    • Color Fundus Photos (CFP): Like a standard photo of the back of the eye (the most common type).
    • OCT Scans: Like a high-tech "slice" of the eye, showing layers like a loaf of bread.
    • Surgical Scenes: Videos and photos of actual eye surgeries.
    • Lens Photos: Close-ups of the eye's front lens (for cataracts).
  • The Annotations: Every image is tagged with detailed notes: "Here is the optic nerve," "Here is a tumor," "This patient is 60 years old." It's like having a textbook where every diagram is labeled by a team of expert doctors.

2. The Four-Part Exam (The Tasks)
LMOD+ doesn't just ask the AI to guess a disease. It puts the AI through a rigorous four-part board exam:

  • Anatomy 101: "Point to the optic nerve and the retina." (Can the AI see the parts?)
  • Diagnosis: "Does this patient have diabetic retinopathy? Explain why." (Can it spot the disease and talk about it?)
  • Staging: "If they have the disease, how bad is it? Is it Stage 1 or Stage 4?" (Can it judge severity?)
  • Demographics: "Based on this eye, guess the patient's age and gender." (This tests if the AI is biased or if it can pick up subtle clues).

3. The "Zero-Shot" Challenge
The most difficult part of this test is that the AI is asked to take the exam without any prior studying on these specific questions. This is called "zero-shot."

  • Analogy: Imagine handing a medical student a brand new, complex eye scan and asking them to diagnose it before they've ever seen a similar case in class. Most AI models failed this test, performing barely better than random guessing.

The Results: The AI is Smart, But Not a Doctor Yet

The researchers tested 24 different AI models (including famous ones like GPT-4o, Qwen, and InternVL) on this new exam.

  • The Good News: Some models, like Qwen and InternVL, showed promise. They could screen for diseases with about 58% accuracy without any special training. That's better than a coin flip, but not good enough to trust with a human life yet.
  • The Bad News:
    • The "Hallucination" Problem: When the AI didn't know the answer, it sometimes made things up. It might confidently say, "I see a tumor," when there was none, or it might get stuck in a loop repeating the same word forever.
    • The "Medical Knowledge" Gap: Some models that were specifically trained on medical texts (like "LLaVA-Med") actually performed worse than general models. It turns out, just reading medical books isn't enough; you need to learn how to look at medical pictures.
    • The "Staging" Struggle: While the AI could sometimes guess "Yes/No" for a disease, it was terrible at figuring out how bad the disease was. This is crucial because treatment depends on severity.

Why This Matters

Think of LMOD+ as a standardized driving test for AI cars. Before, we were testing self-driving cars on empty parking lots (simple datasets). Now, LMOD+ throws them into a busy city with rain, pedestrians, and complex traffic signs (real-world ophthalmology).

The Takeaway:
Current AI is like a very smart student who has read all the textbooks but has never actually practiced on real patients. It knows the theory but struggles with the messy reality of a real eye exam.

The authors are releasing this "exam" and the "study materials" to the public. Their goal is to help developers build better AI doctors. If we can train these models to pass the LMOD+ exam, we could eventually have AI assistants that help doctors in remote villages diagnose eye diseases instantly, saving millions of people from going blind.

In short: We built a better test to see if AI is ready to be a doctor. The answer is: "Not quite yet, but we now have the tools to teach them how to get there."