Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

This paper introduces EyExIn, a data-efficient framework that enhances retinal Vision Language Models by employing a dual-stream encoding strategy and a deep expert injection mechanism to bridge perception and reasoning gaps, thereby achieving state-of-the-art precision in ophthalmic diagnosis while preventing hallucinations.

Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a brilliant, well-read librarian (the AI) how to diagnose eye diseases by looking at photos of the back of the eye (retina).

The problem is that this librarian has read millions of books about general topics but has never actually looked at a real eye in a medical setting. When you show them a photo, they might guess the disease based on what they've read in books, rather than what they actually see in the picture. This leads to two big mistakes:

  1. The "Blind Spot" (Perception Gap): They miss tiny, subtle clues (like a tiny burst blood vessel) because their eyes aren't trained for medical details.
  2. The "Daydream" (Reasoning Gap): As they think harder, they start ignoring the photo entirely and just make up a diagnosis based on their general knowledge, often inventing diseases that aren't there or missing real ones.

The paper introduces a new system called EyExIn (Eye Expert Injection) to fix this. Here is how it works, using simple analogies:

1. The "Dual-Eye" Strategy (Expert-Aware Dual-Stream)

Instead of giving the librarian just one pair of eyes, EyExIn gives them two pairs of glasses working together:

  • The "General Glasses": These look at the big picture. They see the overall shape of the eye, the color, and the main structures (like the optic disc). This is like looking at a landscape painting and seeing the mountains and rivers.
  • The "Microscope Glasses": These are specialized for doctors. They zoom in on the tiny, dangerous details like micro-aneurysms or tiny leaks that the general glasses would miss.

The Magic Glue (Semantic-Adaptive Gated Fusion):
If you just mash these two views together, the tiny details might get lost in the noise of the big picture. EyExIn uses a smart "traffic cop" (a Gated Fusion module).

  • If the librarian sees a healthy area, the traffic cop says, "Ignore the microscope, just look at the general view."
  • If the librarian sees a suspicious spot, the traffic cop says, "Stop! Zoom in with the microscope glasses right now!"
    This ensures the AI focuses its attention exactly where the disease is, filtering out the background noise.

2. The "Anchoring" System (Deep Expert Injection)

Even with the right glasses, the librarian might still get distracted. As they start writing their report (the "reasoning" part), they might forget what they saw in the photo and start guessing based on their memory of textbooks.

EyExIn solves this with "Vision Anchors."
Imagine the librarian is writing a story on a long scroll. Every few paragraphs, a heavy, unbreakable anchor is dropped onto the scroll, pinning the current sentence to the original photo.

  • How it works: The system takes the "Microscope" view and injects it directly into the middle of the AI's thinking process. It's like a persistent reminder that says, "Hey, don't forget! There is a leak right here in the image. Base your next sentence on THIS, not on what you think you know."
  • This prevents the AI from "hallucinating" (making things up) because the visual evidence is physically "anchored" to its thoughts.

3. The Results: A Trustworthy Doctor

The researchers tested EyExIn against massive, famous AI models (like the "GPT-5" or "Gemini" of the future).

  • The Old Way: The big AIs were confident but often wrong. They would miss tiny diseases or invent fake ones because they were relying too much on their "book smarts" and not enough on the actual photo.
  • The EyExIn Way: Even though it is a smaller model, it acts like a seasoned specialist. It catches the tiny details, ignores the noise, and refuses to guess unless the photo proves it.

In a nutshell:
EyExIn is like taking a smart but inexperienced intern and giving them specialized medical glasses and a permanent tether to the patient's actual eye scan. This stops them from daydreaming and forces them to make diagnoses based strictly on what is actually there, making them a much safer and more reliable tool for doctors.