LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

Imagine you are trying to teach a brilliant but slightly scatterbrained artist (an AI) how to write a medical report based on an X-ray or an eye scan.

The Problem: The "Hallucinating Artist"
Currently, if you show this artist a picture of a broken bone and ask, "What do you see?", they might confidently write a beautiful story about a broken bone, but they might also accidentally invent a broken leg that isn't there, or forget to mention a tiny crack that is actually critical. In the medical world, making things up (hallucinating) or missing details is dangerous.

The old way of training these artists was to just show them a picture and the final report, hoping they would learn the connection. But they often get the facts wrong because they are trying to "see" the image and "write" the story at the exact same time, which is too much mental juggling.

The Solution: Fact-Flow (The "Two-Step Detective")
The authors of this paper, "Fact-Flow," propose a smarter way to train the AI. They break the job into two distinct steps, like a detective team working together:

Step 1: The "Fact Finder" (The Labeler)
Before writing a single word of the report, a specialized AI (the Fact Finder) looks at the image and simply checks off a list of things it sees.
- Analogy: Imagine a security guard at a museum. They don't write a novel about the painting; they just tick a box: "Yes, there is a vase," "Yes, there is a crack," "No, there is no fire."
- This step forces the AI to be honest about what is actually there before it tries to be creative.
Step 2: The "Storyteller" (The Report Writer)
The main AI (the Storyteller) then takes the image and the checklist from Step 1. It is told: "Here is the picture, and here is the list of confirmed facts. Now, write a professional medical report based on only these facts."
- Analogy: This is like giving a writer a strict outline. They can't invent new characters or plot holes because they have to stick to the facts provided in the outline.

The Magic Trick: How do we get the checklist?
You might ask, "Who writes these checklists? Do doctors have to spend hours labeling every single X-ray?" That would be too expensive and slow.

The authors used a clever "bootstrapping" trick:

They took existing medical reports (which are just text) and asked a super-smart Large Language Model (LLM) to read them and extract the key facts automatically.
Analogy: It's like asking a librarian to read a thousand books and automatically create a master index of all the topics mentioned, without needing a human to read every page and write a tag. This created a massive training dataset for free.

The Results
When they tested this "Two-Step Detective" system on real medical data (like chest X-rays for tuberculosis and eye scans for retinal issues):

Fewer Lies: The AI stopped making up fake diseases.
Better Memory: It stopped forgetting important details.
Still Good Writing: The reports were still easy to read and sounded professional, just like before.

In a Nutshell
Think of Fact-Flow as putting a "fact-checker" in the room before the "writer" starts typing. By separating the job of finding the truth from the job of telling the story, the AI becomes much more reliable, making it safer to use in real hospitals.

1. Problem Statement

The automatic generation of medical reports from diagnostic images using Multimodal Large Language Models (MLLMs) faces a critical barrier: factual instability.

Hallucinations & Omissions: Current MLLMs, when fine-tuned end-to-end, frequently hallucinate non-existent findings or omit critical pathological observations.
Root Cause: The authors hypothesize that this instability arises from coupling two distinct cognitive processes—visual feature recognition and medical language organization—within a single model.
Data Scarcity: Existing methods often rely on fixed vocabularies or require expensive manual annotation of fine-grained clinical findings (multi-labels) for large-scale datasets, which is infeasible for disease-specific domains (e.g., ophthalmology, tuberculosis).

2. Methodology: Fact-Flow Framework

The proposed Fact-Flow framework decouples visual recognition from report generation. It introduces an intermediate step where clinical findings are explicitly identified before the report is composed. The framework consists of three stages:

Stage 1: LLM-Bootstrapped Multi-Label Dataset Construction

To overcome the lack of labeled data without manual annotation, the authors designed an automated pipeline using a Large Language Model (LLM):

Taxonomy Extraction: Training reports are batched and processed by an LLM to extract clinically significant concepts (diseases, features, locations, severity).
Hierarchical Merging: The extracted concepts are iteratively merged and normalized by the LLM to eliminate synonyms and redundancies, creating a unified label taxonomy ( $L$ ).
Annotation & Filtering: The LLM annotates each training report with a binary vector indicating the presence of labels in $L$ $L$ . Low-frequency labels are filtered out to mitigate long-tail issues.
- Result: A large-scale dataset of (Image, Multi-label) pairs ( $D_{MLC}$ ) created without human intervention.

Stage 2: Guidance Model Training

A dedicated Multi-Label Classification (MLC) model ( $f_{MLC}$ ) is trained on the bootstrapped dataset to predict clinical findings from images.

Architecture: Uses a pre-trained vision encoder (DINOv3 with ConvNeXt backbone).
Loss Function: Addresses severe class imbalance (rare but critical findings) using Logit Adjustment. The raw logits are shifted by the log-odds of the empirical label frequency before computing the binary cross-entropy loss.
- Formula: $\tilde{z}_j = z_j + \tau \log(\frac{p_j}{1-p_j})$
- This re-balances the decision boundary to improve recall on tail classes.

Stage 3: Guided Report Generation

The MLLM is fine-tuned to generate reports conditioned on both visual features and the predicted clinical findings.

Training: The ground-truth labels are serialized into a natural language prompt (e.g., "The image shows: [label A], [label B]...") and prepended to the target report.
Inference: Since ground-truth labels are unavailable, the predicted labels ( $\hat{Y}$ ) from Stage 2 are serialized into the prompt. This forces the MLLM to ground its generation in explicitly identified facts, reducing hallucinations.

3. Key Contributions

Fact-Flow Framework: A novel architecture that improves MLLM report generation via explicit multi-label clinical finding conditioning, effectively decoupling visual recognition from text generation.
Automated Data Pipeline: A fully LLM-bootstrapped pipeline that constructs large-scale, fine-grained multi-label datasets from existing image-report pairs, eliminating the need for costly manual annotation.
Empirical Validation: Comprehensive validation on two distinct disease-focused datasets (Tuberculosis and Ophthalmology) demonstrating significant improvements in factual accuracy while maintaining high text quality.

4. Experimental Results

The method was evaluated on two datasets:

Tuberculosis Chest X-ray: 561 training, 80 validation, 160 test images.
Ophthalmology: 1,854 training, 206 validation, 515 test cases (Fundus, OCT, OCTA modalities).

Performance Highlights:

Clinical Efficacy (Tuberculosis): Fact-Flow significantly outperformed state-of-the-art (SOTA) methods.
- MedGemma + Fact-Flow achieved an F1-score of 0.3055, compared to 0.2266 for the vanilla MedGemma and near-zero scores for LLaVA-Med (which suffered from mode collapse).
- It improved Recall significantly (e.g., Qwen2.5-VL Recall increased from 0.0145 to 0.2361) without sacrificing Precision.
Natural Language Generation (NLG):
- On the Ophthalmology dataset, Qwen2.5-VL + Fact-Flow achieved the best results across BLEU, ROUGE-L, and CIDEr metrics (e.g., CIDEr: 0.2759 vs. 0.1382 for baseline).
Ablation Studies:
- Image Only: Suffered from mode collapse (high precision, near-zero recall).
- Label Only (Predicted): Substantially improved metrics, proving the value of factual guidance.
- Image + Label (Full Fact-Flow): Achieved the best practical performance, confirming that visual context and factual guidance are complementary.
- Oracle Upper Bound: Using ground-truth labels instead of predicted ones yielded even higher scores, indicating that improving the MLC stage (Stage 2) is a key area for future gains.

5. Significance

Clinical Applicability: By addressing the "hallucination" problem, Fact-Flow makes MLLMs more viable for real-world clinical deployment where factual accuracy is non-negotiable.
Scalability: The LLM-bootstrapped data pipeline solves the bottleneck of manual annotation, allowing the framework to be adapted to new diseases or modalities without expensive human labeling efforts.
Plug-and-Play: The framework is compatible with any MLLM architecture (tested on Qwen2.5-VL, MedGemma, and LLaVA-Med), offering a generalizable solution for medical report generation.
Decoupled Reasoning: The approach validates the hypothesis that separating visual fact-finding from language generation leads to more reliable and precise medical narratives.

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

1. Problem Statement

2. Methodology: Fact-Flow Framework

Stage 1: LLM-Bootstrapped Multi-Label Dataset Construction

Stage 2: Guidance Model Training

Stage 3: Guided Report Generation

3. Key Contributions

4. Experimental Results

5. Significance

More like this

When Consistency Becomes Bias: Interviewer Effects in Semi-Structured Clinical Interviews

Demystifying When Pruning Works via Representation Hierarchies

Fine-Tuning A Large Language Model for Systematic Review Screening

Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

Enhancing Structured Meaning Representations with Aspect Classification