MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment

Imagine you are training a brilliant but inexperienced medical student to become a radiologist. Your goal isn't just to teach them to see an X-ray or MRI, but to act as a Quality Control Inspector. They need to look at an image and say, "This is blurry," "There's a metal artifact blocking the view," or "This is perfect for diagnosis."

The problem? Teaching this skill is incredibly expensive. You need real doctors (experts) to write long, detailed reports on thousands of images to train the student. But doctors are busy, and paying them to review every single image is impossible. Also, if you just show the student random images, they might keep making the same specific mistakes over and over because they aren't being taught to fix their weaknesses.

MedQ-Engine is a clever, automated system designed to solve this. Think of it as a smart, self-improving training camp that runs in a loop. Here is how it works, broken down into three simple phases:

Phase 1: The "Failure Detective" (Evaluating)

First, the system tests the AI student on a practice exam. Instead of just looking at the final score, it acts like a detective. It looks at where the student failed.

The Analogy: Imagine a teacher grading a math test. Instead of just saying "You got a C," the teacher notices, "Oh, this student gets every geometry problem wrong but aced the algebra."
What MedQ-Engine does: It groups these mistakes into "Failure Prototypes." It creates a mental map of the specific types of bad images the AI hates (e.g., "MRI scans with metal implants" or "blurry endoscopy photos").

Phase 2: The "Smart Scout" (Exploring)

Now that the system knows exactly what the student is bad at, it goes hunting for more practice material. It has a massive warehouse of 1 million unlabeled medical images.

The Analogy: Instead of randomly grabbing books from a library, the teacher uses the "Failure Prototypes" as a search key. They specifically pull out books that contain only the types of problems the student struggles with.
The Human Touch (The Cost Saver): This is where it gets smart about money. The system asks a super-smart AI (like GPT-4o) to draft the answers first.
- If the student AI is confident and agrees with the super-AI, no human is needed.
- If the student is confused or disagrees with the super-AI, then a human doctor is called in to check.
- The Result: Humans only have to review about 18% of the images. The rest are handled by the AI team, saving massive amounts of time and money.

Phase 3: The "Coach" (Evolving)

The system takes the new, high-quality, human-verified data and gives the student a "crash course" (fine-tuning). The student learns specifically how to fix the mistakes they were making.

The Loop: Then, the whole process starts again. The student takes a new test, the system finds the new weaknesses, and the cycle repeats. The student gets better and better with every round.

Why is this a big deal?

The paper shows that using this "MedQ-Engine" is a game-changer:

Small Model, Big Brain: They took a relatively small AI model (8 billion parameters) and, using this method, made it smarter than GPT-4o (a massive, top-tier model) at this specific medical task.
Human-Level Performance: The trained model is now only 4.34% away from the performance of actual human doctors.
Efficiency: They achieved this with only 10,000 annotated images. If they had just picked images randomly, they would have needed 40,000+ images to get the same result. That's 4x more efficient.

In summary: MedQ-Engine is like a personal trainer for AI. It doesn't just make the AI run more laps; it identifies exactly which muscles are weak, designs a specific workout for those muscles, and only calls in the expensive human coach when the AI really gets stuck. The result is a medical AI that is incredibly sharp, cost-effective, and ready to help doctors ensure their images are safe for diagnosis.

1. Problem Statement

Medical Image Quality Assessment (Med-IQA) is a critical prerequisite for the reliable deployment of clinical AI. While Multimodal Large Language Models (MLLMs) show promise in generating descriptive assessments (identifying degradation types, analyzing visual impact, and evaluating severity), they currently lag significantly behind human experts.

The paper identifies two primary bottlenecks hindering the improvement of MLLMs in this domain:

High Annotation Cost: Comprehensive expert descriptions are prohibitively expensive to acquire compared to simple scalar scores, which provide minimal training signal.
Static Data Limitations: Traditional one-time data collection cannot adapt to the evolving weaknesses of a model. As a model improves, new bottlenecks emerge, but static datasets fail to target these specific failure modes efficiently.
Non-Uniform Errors: Errors are not randomly distributed; they concentrate in specific intersections of capabilities and modalities, suggesting that uniform data augmentation is inefficient.

2. Methodology: MedQ-Engine

The authors propose MedQ-Engine, a closed-loop data engine designed to iteratively improve MLLMs through three phases: Evaluate, Explore, and Evolve. The system operates on a massive unlabeled image pool ( $\sim$ 1 million images) and a dedicated development set ( $D_{dev}$ ) to prevent data leakage.

Phase 1: Evaluating (Failure Discovery)

Failure Collection: The model ( $M_\theta$ ) is evaluated on $D_{dev}$ across multiple modalities. Samples where the error rate exceeds a threshold $\gamma$ across multiple runs are identified as persistent "failure cases."
Data-Driven Clustering: Instead of using predefined error categories, the system aggregates failure cases based on visual content and Q-A features. It applies agglomerative clustering to identify failure prototypes ( $N_c$ centroids) that capture dominant error patterns.
Capability Analysis: The system computes error rate distributions across capability dimensions (e.g., artifact detection, severity assessment) to guide data collection priorities.

Phase 2: Exploring (Targeted Data Acquisition)

Prototype-Based Retrieval: The visual components of the identified failure prototypes serve as retrieval anchors. The system queries the 1M-image unlabeled pool ( $U$ ) using cosine similarity to find images similar to the failure prototypes.
Adaptive Sampling: Sampling weights are assigned based on the error rates of specific capability dimensions ( $w_k \propto e_k^\alpha$ ), prioritizing the model's weakest areas.
Progressive Human-in-the-Loop (HITL) Annotation: To minimize expert cost, the system uses an entropy-guided routing mechanism:
- Cold Start ( $t=0$ ): GPT-4o pre-annotates initial samples, which are fully reviewed by human experts to build a seed dataset.
- Self-Evolution ( $t>0$ ): For new samples, the system generates a self-annotation ( $\hat{y}_{self}$ $\overset{y}{^}_{se l f}$ ) and a reference from GPT-4o ( $\hat{y}_{GPT}$ $\overset{y}{^}_{GP T}$ ).
  - If the model is uncertain (high trajectory entropy), it adopts the GPT-4o annotation.
  - If the model is confident but disagrees with the oracle, it escalates to human expert review.
  - If the model is confident and consistent, it adopts the self-annotation directly.
- This strategy reduces human review to only ~18% of samples in later iterations.

Phase 3: Evolving (Model Improvement)

Quality Assurance: The collected data undergoes deduplication (perceptual hashing) and diversity filtering (TF-IDF) to ensure high-quality, non-redundant training data.
Fine-Tuning: The model is fine-tuned using supervised instruction tuning on the new high-quality dataset.
Closed Loop: The updated model re-enters Phase 1, creating a self-improving cycle that terminates when performance plateaus.

3. Key Contributions

First Closed-Loop Data Engine for Med-IQA: MedQ-Engine transforms data-driven error analysis into systematic model improvement via an iterative evaluate-explore-evolve cycle.
Failure-Driven Adaptive Sampling: Introduces a mechanism that discovers failure prototypes via clustering and uses them to retrieve targeted training data, maximizing information gain per annotation.
Entropy-Guided HITL: A novel annotation paradigm that dynamically routes samples to AI or human experts based on model confidence and agreement, drastically reducing labeling costs.
Empirical Validation: Demonstrates that targeted data curation can compensate for massive differences in model scale, allowing smaller models to outperform larger, specialized counterparts.

4. Experimental Results

The system was evaluated on MedQ-Bench across five medical imaging modalities (MRI, CT, Endoscopy, Fundus, Histopathology).

Performance Gains:
- An optimized InternVL3-8B model (trained with only 10K annotations) achieved 78.16% overall perception accuracy.
- This surpasses GPT-4o (64.79%) by over 13 percentage points.
- It narrows the gap with human experts (82.50%) to only 4.34%.
Sample Efficiency:
- MedQ-Engine achieved the same performance level as random sampling with 40K samples using only 10K samples, demonstrating >4× sample efficiency.
- It significantly outperformed random sampling across all data scales.
Qualitative Improvements:
- Optimized models shifted from generic descriptions to anatomically specific assessments with clinical reasoning and actionable recommendations (e.g., identifying specific metal artifacts and their impact on diagnostic utility).
Ablation Studies:
- Removing the Human-in-the-Loop component caused the largest performance drop, highlighting its critical role in description quality.
- Removing adaptive sampling or capability-driven QA generation also resulted in significant performance degradation.
Cost Reduction:
- The progressive annotation strategy reduced the average human review time per sample from 5.1 minutes (full review) to 0.5 minutes (cold start) and reduced the human review rate to 18% in subsequent iterations, cutting overall expert costs by >5×.

5. Significance

MedQ-Engine offers a scalable blueprint for adapting MLLMs to specialized domains where expert annotations are scarce and model weaknesses are non-uniform. By shifting from static, uniform data collection to a dynamic, failure-driven closed loop, the framework proves that data quality and targeting are more impactful than sheer model scale or data volume. This approach not only solves the specific challenge of Medical Image Quality Assessment but also provides a generalizable paradigm for evolving AI systems in other high-stakes, expert-dependent fields.