PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Imagine you are trying to solve a complex medical mystery, like identifying a specific type of skin cancer from a photo.

The Old Way (The "Single-View" Detective)
Traditionally, AI doctors have been like detectives who only look at the crime scene photo. They might see a dark spot on a picture and guess, "That looks like a mole." But they don't know the patient's history. They don't know if the patient spends all day in the sun, has a family history of cancer, or if the spot is growing fast. Because they lack this context, they often get it wrong. It's like trying to guess the ending of a movie by only looking at one frame.

The New Way: PRIMA (The "Super-Consultant" Team)
The authors of this paper created a new system called PRIMA. Think of PRIMA not as a single detective, but as a super-team of specialists working together to solve the case.

Here is how PRIMA works, broken down into three simple steps:

1. The Knowledge Librarian (Stage 1)

Before the team even looks at a patient, they need to study.

The Problem: Standard AI models are like general students; they know a lot of things but aren't experts in specific medical details.
The PRIMA Solution: The team hires a "Knowledge Librarian" (a specialized AI called RAG). This librarian reads thousands of medical textbooks and research papers. It doesn't just memorize facts; it creates a cheat sheet that connects specific risk factors (like "sun exposure" or "age") to specific diseases.
The Analogy: Imagine a detective who, before leaving the station, reads every case file from the last 50 years. Now, when they see a clue, they instantly remember, "Ah, this specific clue usually appears in cases involving X, not Y."

2. The Double-Check System (Stage 2)

Now the team looks at the patient. They have two eyes: one looking at the Photo (the skin lesion) and one reading the Patient's File (age, history, symptoms).

The Problem: Photos and text speak different languages. A photo shows "irregular edges," while the text says "asymmetrical." The AI needs to understand that these two things mean the same thing.
The PRIMA Solution: They use a special training game with four different rules (Loss Functions) to force the photo-eye and the text-ear to agree with each other.
1. Consistency Check: If we take two photos of the same patient, the AI must agree they look similar.
2. Big Picture Match: The AI must match the general "vibe" of the photo with the general "vibe" of the text.
3. Detail Match: The AI must link specific parts of the photo (like a red spot) to specific words in the text (like "inflammation").
4. The "Soft" Guess: Sometimes, the text isn't 100% clear. The AI uses "soft labels" to say, "This looks 70% like Disease A and 30% like Disease B," rather than forcing a wrong 100% guess.
The Analogy: It's like a translator and a photographer working together. The photographer says, "I see a red, jagged shape." The translator says, "That matches the description of 'irregular border' in the medical file." They keep practicing until they are perfectly in sync.

3. The Final Judge (Stage 3)

Once the photo and the text are perfectly synced, they hand their notes to the Final Judge (a powerful Large Language Model called Qwen-3).

The Problem: Sometimes AI gets confused and makes up facts (hallucinations).
The PRIMA Solution: The Judge is given strict rules. It can only choose from a pre-approved list of diseases (like a multiple-choice test). It takes the combined wisdom of the photo and the text and picks the best answer.
The Analogy: Imagine a brilliant professor (the Judge) who listens to the detective's report and the medical file. Instead of writing a long, confusing essay, the professor just circles the correct answer on a multiple-choice sheet, ensuring the diagnosis is precise and safe.

Why is this a big deal?

It's Smarter: By combining the picture with the patient's story, PRIMA makes fewer mistakes than systems that only look at pictures.
It's Efficient: You don't need millions of patient photos to train it. Because the "Knowledge Librarian" already read the textbooks, the system learns faster and needs less data.
It's Robust: Even if the data is messy or the disease is rare, PRIMA uses its "cheat sheet" of medical knowledge to make a smart guess.

In short: PRIMA is like upgrading a medical AI from a student who only looks at pictures, to a seasoned specialist who reads the patient's entire history, consults the latest medical books, and then makes a diagnosis with high confidence.

1. Problem Statement

Medical diagnosis relies on synthesizing visual data (medical images) with clinical metadata (patient risk factors, history, demographics). However, current deep learning approaches face three critical limitations:

Isolated Metadata Treatment: Existing methods often treat metadata as isolated tags rather than exploiting the rich semantic knowledge embedded in clinical descriptions.
Data Scarcity & Generalizability: Specialized medical tasks and rare diseases often lack massive paired datasets, making data-intensive pre-training (like standard CLIP or large-scale LLM fine-tuning) infeasible.
Modality Gap: There is a significant disconnect between pixel-level image features and abstract clinical expertise, leading to misclassifications when context (e.g., patient age, sun exposure) is ignored.

2. Methodology: The PRIMA Framework

PRIMA proposes a three-stage training pipeline designed to integrate domain-specific knowledge into multi-modal representation learning without requiring massive patient datasets.

Stage 1: Corpus Curation and Knowledge Prior Injection

Goal: To inject diagnostic priors into the text encoder without relying on scarce clinical image-text pairs.
Process:
- RAG-Based Knowledge Synthesis: The authors use Retrieval-Augmented Generation (RAG) with GPT-5.1 and Gemini-2.5 to query public medical literature (PubMed). The LLMs generate structured descriptions of relationships between specific risk factors (e.g., age, family history) and six skin lesion diagnoses.
- Expert Validation: These generated descriptions are vetted by senior physicians to ensure accuracy.
- Text Encoder Fine-tuning: A Clinical ModernBERT is fine-tuned on this curated corpus using Masked Language Modeling (MLM). To ensure efficiency, LoRA (Low-Rank Adaptation) is used, updating only ~1% of parameters. This creates a "Knowledge-Enhanced" text encoder.

Stage 2: Risk-integrated Image-Metadata Alignment

Goal: To align visual features (from images) with the knowledge-enhanced text features.
Architecture:
- Image Encoder: DINOv3 (frozen backbone with LoRA adapters).
- Text Encoder: The refined Clinical ModernBERT from Stage 1.
- Input: Multi-modal scans ( $v_{i,j}$ ) and structured clinical metadata ( $t_i$ ).
Four Complementary Loss Functions: The framework employs a dual-encoder strategy optimized by four specific losses to handle multi-granular alignment and clinical ambiguity:
1. Image Consistency Loss ( $\mathcal{L}_{img}$ ): Enforces intra-patient visual consistency by aligning global visual tokens from different scans or augmentations of the same patient.
2. Global Semantic Loss ( $\mathcal{L}_{glo}$ ): A symmetric cross-entropy loss that aligns global image tokens with global text tokens, synchronizing high-level semantic context.
3. Local Semantic Loss ( $\mathcal{L}_{loc}$ ): Uses an attention-guided mechanism to align specific image patches with specific text tokens, grounding abstract attributes (e.g., "irregular borders") to visual manifestations.
4. Soft Semantic Loss ( $\mathcal{L}_{soft}$ ): Addresses clinical ambiguity by using metadata similarity matrices to create soft labels (soft targets) rather than strict one-to-one mappings, allowing the model to learn shared attributes across patients.
Final Alignment: The total loss is a weighted sum of these four components. The image encoder is subsequently supervised fine-tuned with ground-truth labels.

Stage 3: Feature Integration via Large Language Model

Goal: To synthesize aligned features for precise diagnosis.
Process:
- Global and local tokens from both encoders are projected and concatenated into a sequence.
- Qwen-3 (1.7B) serves as the fusion backbone.
- Efficiency: LoRA is applied to the LLM, updating only ~1% of parameters.
- Vocabulary-Restricted Output: To prevent hallucinations, the model is constrained to output logits only from a pre-defined subset of clinical classes (e.g., Melanoma, Nevus), rather than free-form generation.

3. Key Contributions

Knowledge-Enhanced Encoding: A novel method to elevate metadata to semantic knowledge by fine-tuning ClinicalBERT on RAG-derived expert corpora, injecting domain priors without massive paired datasets.
Multi-Granular Alignment Strategy: A versatile framework utilizing four complementary loss functions ( $\mathcal{L}_{img}, \mathcal{L}_{glo}, \mathcal{L}_{loc}, \mathcal{L}_{soft}$ ) to orchestrate global-local integration and handle the ambiguity of clinical correlations.
LLM-Driven Diagnosis: A unified pipeline leveraging Qwen-3 to synthesize multi-modal features, achieving state-of-the-art (SOTA) performance with high efficiency and robustness.

4. Experimental Results

The framework was evaluated on two datasets: PAD-UFES-20 (skin lesions, public) and AQUA (keratitis, private slit-lamp images).

Performance: PRIMA significantly outperformed all baselines, including pure image models (DINOv3), simple fusion methods, and other medical VLMs (MedKLIP, KnoBo, MedBLIP).
- PAD-UFES-20: Achieved 73.75% F1-score and 78.27% Accuracy (a >5% boost over the best baseline).
- AQUA: Achieved 85.22% F1-score and 86.04% Accuracy.
Ablation Study: Removing any of the four loss functions or the knowledge injection stage resulted in significant performance drops, validating the necessity of the multi-granular alignment and expert priors.
Robustness: The model demonstrated superior generalization on the private AQUA dataset, proving that performance gains stem from the alignment strategy rather than data memorization or pre-training on the specific test data.

5. Significance

Efficiency: PRIMA achieves SOTA results without the need for massive data collection or exhaustive computational resources, making it suitable for specialized medical tasks where data is scarce.
Clinical Relevance: By explicitly integrating risk factors and expert knowledge, the model mimics the holistic diagnostic process of human physicians, reducing misclassifications caused by limited context.
Scalability: The use of LoRA and vocabulary-restricted LLM inference makes the framework computationally feasible for deployment in resource-constrained clinical environments.

In summary, PRIMA bridges the gap between visual AI and clinical expertise by creating a knowledge-rich, multi-modal alignment framework that is both highly accurate and computationally efficient.

PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

1. The Knowledge Librarian (Stage 1)

2. The Double-Check System (Stage 2)

3. The Final Judge (Stage 3)

Why is this a big deal?

1. Problem Statement

2. Methodology: The PRIMA Framework

Stage 1: Corpus Curation and Knowledge Prior Injection

Stage 2: Risk-integrated Image-Metadata Alignment

Stage 3: Feature Integration via Large Language Model

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation