Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Imagine you are trying to teach a computer to "read" medical scans (like CTs and MRIs) just by looking at them and reading the doctor's notes attached to them. This is called Language-Image Pre-training.

For a long time, doing this for 3D scans (which are like thick stacks of bread slices) has been a nightmare. Here is why, and how this paper, HLIP, solves it.

The Problem: The "Curator" Bottleneck

Think of a 3D medical study like a huge library of books about a single patient.

The Old Way: To teach the computer, researchers had to hire a librarian (a radiologist) to go through thousands of these libraries, pick out just one perfect page from one book, and throw the rest away. They did this for every single patient.
The Result: This was slow, expensive, and limited how much data the computer could learn from. It was like trying to learn a language by reading only one sentence from a dictionary.
The 3D Problem: Even if you gave the computer the whole library, standard computer brains (AI models) were designed for flat 2D pictures. If you fed them a whole 3D library, they got overwhelmed, ran out of memory, and couldn't understand the story.

The Solution: HLIP (The Smart Librarian)

The authors, Chenhui Zhao and his team from the University of Michigan, decided to stop throwing data away. Instead, they taught the computer to read the entire library (the uncurated study) directly.

To do this, they invented a new way for the computer to pay attention, called Hierarchical Attention.

The "Russian Doll" Analogy

Imagine a 3D medical study is a set of Russian nesting dolls:

The Study (The Big Doll): This is the whole patient file. It contains many different "books" (scans) like T1, T2, FLAIR, etc.
The Scan (The Middle Doll): Inside the study, there are specific books. Each book has many pages.
The Slice (The Tiny Doll): Inside each book, there are individual pages (slices) stacked on top of each other.

Old AI models tried to look at every single page of every single book in the library all at once. Their brains exploded.
HLIP uses a smart strategy:

It looks at a few pages together to understand the Scan.
It looks at a few scans together to understand the Study.
It only zooms out to look at the whole library when it really needs to connect the dots.

This is like reading a book: you don't try to memorize every letter on every page simultaneously. You read a sentence, then a paragraph, then a chapter, and finally the whole story. HLIP does this for medical scans, making it fast and efficient.

The Results: Superpowers for the AI

They trained this new "Smart Librarian" (HLIP) on a massive amount of real-world data:

220,000 Brain MRI studies (3.13 million scans).
240,000 Head CT studies (1.44 million scans).

They didn't ask a single radiologist to pick out "good" pages. They just fed the computer the raw, messy, real-world data.

What happened?

Brain MRI: The AI got 10.5% better at diagnosing brain diseases (like strokes and tumors) without being explicitly told what to look for.
Head CT: It became significantly better at spotting brain bleeds and fractures compared to previous top-tier models.
Generalization: Even when tested on chest CTs (which it wasn't specifically trained on), it performed better than models trained on much smaller, "curated" datasets.

Why This Matters

Think of the old way as trying to learn to drive by only practicing on a closed, perfect track with a coach holding your hand.
HLIP is like letting the student drive on real highways, in the rain, with traffic, and no coach.

Because HLIP can handle the "messy" real world, it can learn from millions of patient records instead of just thousands. This means:

Scalability: We can now train AI on the massive amounts of data hospitals already have, without needing expensive human help to clean it up.
Better Diagnosis: The AI learns from the full picture, not just a tiny slice, making it more accurate at spotting diseases.
Future Ready: This opens the door for AI that can understand complex, multi-part medical stories, just like a human doctor does.

In short, HLIP is the key that unlocks the potential of the world's biggest medical databases, teaching AI to read the whole story, not just the highlights.

1. Problem Statement

Current language-image pre-training for 3D medical imaging (e.g., CT, MRI) faces two primary bottlenecks that prevent it from reaching the scale and performance seen in 2D modalities (like Chest X-rays):

The Annotation/Curation Bottleneck: Existing methods rely on radiologists manually curating datasets by selecting a single representative scan or slice from a study. This process is labor-intensive, limits data scale, and does not reflect real-world clinical workflows where a single patient study often contains multiple scans (e.g., T1, T2, FLAIR sequences) and varying orientations.
Architectural Limitations: Standard Vision Transformers (ViT) designed for 2D images or single 3D scans struggle with uncurated studies. A full study can contain $10^4$ tokens, making global self-attention computationally prohibitive (memory and complexity $\Omega(N^2)$ ). Naive approaches, such as randomly selecting a slice or encoding the entire study with a vanilla ViT, result in poor zero-shot performance compared to models trained on curated data.

2. Methodology: HLIP Framework

The authors propose HLIP (Hierarchical attention for Language-Image Pre-training), a framework designed to pre-train directly on uncurated clinical studies using a novel attention mechanism that mirrors the intrinsic hierarchy of radiology data.

A. Data Hierarchy

The framework leverages the natural three-level structure of radiology data:

Study: The highest level, containing $M$ different 3D scans (modalities/planes) for a patient.
Scan: A single 3D volume containing $D$ slices.
Slice: A group of adjacent slices within a scan.

B. Hierarchical Attention Mechanism

Instead of applying global self-attention over all tokens (which is too expensive), HLIP computes self-attention independently within each hierarchical level:

Slice Attention: Computes attention over adjacent slices ( $h \times w$ tokens). Complexity: $\Omega(\frac{N^2}{M \times d} + N \times c)$ .
Scan Attention: Computes attention over all slices within a single scan ( $d \times h \times w$ tokens). Complexity: $\Omega(\frac{N^2}{M} + N \times c)$ .
Study Attention: Computes attention across all tokens in the entire study ( $N$ tokens). Complexity: $\Omega(N^2 + N \times c)$ .

Implementation Strategy:

The ViT backbone is divided into subsets of layers.
Lightweight layers (the majority) perform Slice or Scan attention to capture local and intra-scan features efficiently.
Dense layers (sparse, e.g., every 3rd layer) perform Study attention to aggregate global information across the entire study.
CLS Token Propagation: To handle information flow between different hierarchy levels, the CLS token is cloned when moving from Study $\to$ Scan/Slice and aggregated (averaged) when moving from Slice/Scan $\to$ Study.
Compatibility: This design is orthogonal to efficiency techniques like Flash Attention and Patch Dropout, allowing for large batch sizes.

C. Training Setup

Datasets: Trained on massive, uncurated datasets from a health system:
- BrainMRI220K: 220,993 studies, 3.13 million scans.
- HeadCT240K: 244,253 studies, 1.44 million scans.
Preprocessing: No manual slice selection. Scans are resized to fixed shapes (e.g., $48 \times 224 \times 224$ ). For CT, different Hounsfield Unit (HU) windows are used to create multi-channel inputs.
Text Encoder: PubMedBERT (for MRI/CT) or CXR-BERT (for Chest CT).
Loss: Contrastive Language-Image Pre-training (CLIP) loss.

3. Key Contributions

Scalable Pre-training Paradigm: First work to pioneer language-image pre-training directly on uncurated 3D medical studies, removing the need for radiologist curation and enabling training on millions of scans.
Novel Architecture: Introduction of the Hierarchical Attention Mechanism, which balances computational efficiency with the ability to model complex, multi-scan study structures.
Largest-Scale Training: Conducted the largest training to date for 3D medical imaging, utilizing over 4.5 million scans across brain MRI and head CT.
Public Resources: Release of a new zero-shot benchmark (Pub-Brain-5), model checkpoints, and pre-training recipes.

4. Experimental Results

HLIP demonstrates State-of-the-Art (SOTA) performance across diverse modalities and benchmarks:

Brain MRI (Pub-Brain-5):
- Achieved +10.5% balanced accuracy over the previous SOTA (ConceptCLIP) on the proposed Pub-Brain-5 benchmark.
- Outperformed vanilla ViT and other foundation models significantly in zero-shot disease classification (e.g., 61.3% ACC vs. ~50% for others).
Head CT (CQ500 & RSNA):
- Surpassed the FM-HeadCT foundation model by +8.3% macro AUC on CQ500 and +1.7% on RSNA.
- Outperformed Google-CT and vanilla ViT in both linear-probe and zero-shot settings.
Chest CT (CT-RATE & Rad-ChestCT):
- Even when trained on curated data (CT-RATE), HLIP outperformed specialized models like CT-CLIP and fVLM.
- Achieved +4.3% macro AUC on the external Rad-ChestCT benchmark compared to SOTA.
Prospective Clinical Evaluation:
- In a 1-year, health-system-scale prospective study (23k brain MRIs, 15k head CTs), HLIP consistently outperformed vanilla ViT, achieving higher macro AUC across 52 brain MRI diagnoses and 83 head CT diagnoses.

5. Significance and Impact

Clinical Workflow Alignment: By training on uncurated data, HLIP aligns with real-world radiology workflows where studies are naturally multi-scan and multi-modal, making the model more robust for deployment.
Scalability: The hierarchical attention mechanism solves the computational bottleneck of 3D medical imaging, allowing models to scale to millions of scans without prohibitive memory costs.
Generalizability: The framework is modality-agnostic and anatomically agnostic, showing strong transferability across brain, head, and chest imaging.
Future Direction: The paper establishes that "uncurated pre-training" is a viable and superior path for 3D medical AI, potentially unlocking the full potential of the vast, unlabelled data accumulated in health systems globally.

Code Availability: The authors have released the code at https://github.com/zch0414/hlip.