Leveraging Foundation Models for Content-Based Image Retrieval in Radiology

Imagine you are a doctor walking into a massive, chaotic library. This library doesn't have books; it has 1.6 million medical images (X-rays, MRIs, CT scans, and ultrasounds) from patients all over the world.

Your goal? You see a patient with a strange spot on their lung. You want to find other images in this giant library that look exactly like that spot so you can see how other doctors handled similar cases. This is called Content-Based Image Retrieval (CBIR). It's like asking a librarian, "Show me everything that looks like this," rather than "Show me everything labeled 'Lung Cancer'."

The Problem: The Old Librarians Were Too Specialized

In the past, the "librarians" (computer systems) built for this job were like specialized experts.

One librarian knew everything about broken bones but knew nothing about heart issues.
Another knew everything about the liver but couldn't tell a healthy lung from a sick one.

To use them, you had to hire a different librarian for every single disease. This was slow, expensive, and required a lot of training data for each specific job. If a patient had a rare disease no one had seen before, the librarian would just shrug and say, "I don't know."

The New Solution: The "Super-Reader" Foundation Models

The researchers in this paper asked a different question: What if we used a "Super-Reader" who has already read almost every book in the world?

These "Super-Readers" are called Foundation Models. They are massive AI systems trained on billions of images and text descriptions (like scientific articles). They haven't been taught specifically to find lung spots; they've just learned to understand everything about pictures—shapes, textures, organs, and diseases—just by looking at them.

The researchers tested these Super-Readers to see if they could act as a universal librarian without needing any extra training.

The Experiment: The Great Library Test

The team built a massive test library with 1.6 million images covering 161 different diseases and 24 body parts. They asked various AI models to find matching images.

Here is what they found, using some simple analogies:

1. The "Text-Book" Winners (BiomedCLIP)
The best performers were models like BiomedCLIP. Imagine these models as students who didn't just look at pictures; they read the captions and medical reports alongside the pictures.

Why they won: Because they learned the connection between the image and the words describing it, they understood the "meaning" of the picture, not just the pixels.
The Score: They got about 59% of the top matches correct immediately. That's impressive for a system that wasn't specifically trained for this one task!

2. The "Specialist" (The Old Way)
They also trained a "Specialist" model from scratch on their specific library.

The Result: The Specialist was slightly better (about 65% accuracy).
The Catch: To get this extra 6% boost, the Specialist needed thousands of hours of expensive training and massive amounts of labeled data. The Foundation Models (the Super-Readers) got 90% of the way there just by being "off-the-shelf" (ready to use right out of the box).

3. The "Eye" vs. The "Brain" (Anatomy vs. Disease)
The researchers noticed something funny:

Anatomy (Body Parts): The AI was great at finding "This is a knee" or "This is a liver." It's like recognizing the shape of a house.
Pathology (Diseases): The AI struggled more with "This is a broken bone" or "This is a tumor." It's like trying to spot a tiny crack in the wall of that house.
Why? Diseases are often subtle and look very similar to healthy tissue. The AI sometimes got distracted by the big, obvious body parts and missed the small, tricky disease details.

4. The "Blurry Photo" Problem (X-Rays)
The AI worked best on Ultrasounds and CT scans (which are like 3D slices or clear photos) but struggled the most with X-rays.

Analogy: X-rays are like looking at a 3D object squashed into a 2D shadow. It's hard to tell what's in front and what's in the back. The AI got confused by the shadows, making it harder to find exact matches.

The "Magic Number" of Images

The team also asked: How many pictures do we need in the library for the AI to get really good?
They found that once you have about 1,000 examples of a specific disease in the library, adding more doesn't help much. The AI hits a "ceiling." To get better after that, you don't need more data; you need a smarter AI.

The Big Takeaway

This paper is a game-changer because it suggests we don't need to build a new, custom AI for every single disease anymore.

Before: We needed a different key for every door.
Now: We have a Master Key (the Foundation Model) that opens almost all the doors well enough to be useful, without needing to be custom-made for each one.

While these Master Keys aren't perfect yet (they still miss some tricky diseases), they are powerful, versatile, and ready to use immediately. This means hospitals can start using smart image search tools much faster, helping doctors find similar cases and make better decisions, even for rare diseases they haven't seen before.

In short: The researchers proved that the "Super-Readers" of the AI world are ready to become the universal librarians of radiology, saving us time and money while improving patient care.

1. Problem Statement

Content-Based Image Retrieval (CBIR) systems in radiology aim to retrieve relevant medical images based on visual similarity rather than metadata. While deep learning has improved these systems, current state-of-the-art CBIR solutions face significant limitations:

Specialization: Existing systems are typically trained on specific pathologies or modalities, hindering their ability to generalize to unseen conditions or diverse datasets.
Data Scarcity & Privacy: Training general-purpose models requires massive, diverse, and annotated datasets, which are difficult to aggregate due to privacy regulations and fragmented data access.
Semantic Gap: There is a disconnect between low-level visual features and high-level clinical understanding, making it hard to design systems that align with radiological practice needs.

The authors propose investigating Vision Foundation Models (large-scale pre-trained models) as "off-the-shelf" feature extractors. The hypothesis is that these models, having been exposed to diverse image-text pairs or large-scale self-supervised data, possess a nuanced understanding of modality, anatomy, and pathology without requiring task-specific fine-tuning.

2. Methodology

Dataset Construction

The authors created a comprehensive benchmark dataset by aggregating four public datasets, resulting in 1.6 million 2D radiological images:

Sources: NIH14 (Chest X-ray), MIMIC-CXR, CheXpert, and RadImageNet (CT, MRI, Ultrasound).
Scope: Covers 4 modalities (X-ray, CT, MRI, US), 12 anatomical regions, and 185 classes (161 pathological, 24 anatomical).
Distribution: The dataset reflects the real-world long-tailed distribution of medical diseases, with significant class imbalance.
Split: Patient-wise splits were used to prevent data leakage, with training data used for indexing and test data for querying.

Models Evaluated

The study benchmarks a diverse set of foundation models across three training paradigms:

Supervised: ResNet, ViT (ImageNet pre-trained), Ark (Chest X-ray specific), SAM, and MedSAM.
Weakly-Supervised (Image-Text): CLIP, MedCLIP, BiomedCLIP, and BMC-CLIP. These align image and text embeddings.
Self-Supervised: MAE, DINOv2, and RAD-DINO (medical adaptation).

Retrieval Pipeline

The authors adopted an off-the-shelf approach (no fine-tuning):

Preprocessing: Images resized to model-specific input dimensions.
Feature Extraction: Images passed through the foundation model to generate dense embeddings (e.g., class tokens or average patch tokens).
Normalization: Embeddings normalized to unit length ( $L_2$ ) for cosine similarity.
Indexing: Embeddings stored in a FAISS vector database.
Retrieval: Query images are compared against the index using cosine similarity to retrieve the top- $N$ most similar images.

Baselines & Metrics

Baseline: A specialized CBIR model (CVNet) trained specifically on the combined dataset using contrastive and classification losses.
Metrics: Precision at $N$ ( $P@N$ for $N=1, 3, 5, 10$ ), calculated via both Micro-averaging (sample-wise) and Macro-averaging (class-wise) to account for class imbalance.
Additional Analysis: Embedding space quality assessed via k-Nearest Neighbors (kNN) classification and Linear Probing.

3. Key Contributions

Comprehensive Benchmark: Created one of the largest and most diverse radiological CBIR benchmarks (1.6M images, 4 modalities, 161 pathologies).
Off-the-Shelf Evaluation: Demonstrated that foundation models can serve as powerful feature extractors without additional training, challenging the necessity of specialized training for every new retrieval task.
Weakly-Supervised Superiority: Identified BiomedCLIP as the top-performing foundation model, achieving a $P@1$ of 0.594, comparable to specialized systems but without the computational cost of retraining.
Index Size Analysis: Investigated the relationship between index size and retrieval performance, finding that performance saturates around 1,000 samples per class.
Anatomy vs. Pathology: Quantified the difficulty gap between retrieving anatomical structures (easier) versus pathological structures (harder), highlighting a key area for future improvement.

4. Key Results

Performance Comparison

Top Foundation Model: BiomedCLIP achieved the highest performance among foundation models ( $P@1$ micro: 0.594; macro: 0.240). BMC-CLIP followed closely.
Specialist vs. Foundation: The specialized model (CVNet-Global101) trained on the dataset still outperformed all foundation models ( $P@1$ micro: 0.650). However, foundation models offer a scalable alternative where training data or resources are limited.
Model Types:
- Weakly-supervised (CLIP variants) generally outperformed supervised and self-supervised models, likely due to the rich semantic alignment learned from image-text pairs.
- Supervised Models: Ark (Chest X-ray specific) performed well, showing good generalization to other modalities.
- Segmentation Models: SAM and MedSAM performed poorly, suggesting that features optimized for segmentation boundaries do not translate well to global semantic retrieval.

Modality-Specific Insights

Ultrasound (US) yielded the highest retrieval accuracy ( $P@1$ up to 0.817), likely due to distinct textural patterns.
X-ray (XR) yielded the lowest accuracy ( $P@1$ ~0.395), attributed to the challenges of 2D projections obscuring 3D details.
Modality Specificity: While BiomedCLIP was best overall, Ark outperformed others specifically on X-ray tasks, likely due to its specific pretraining on chest X-rays.

Anatomical vs. Pathological Retrieval

There is a significant performance gap: Anatomical classes achieved a $P@1$ of 0.812, while Pathological classes only reached 0.451.
Pathological features are subtler and more variable, often overshadowed by consistent anatomical structures in the embedding space.

Embedding Space Analysis

Clustering: BiomedCLIP and BMC-CLIP showed excellent clustering capabilities (high kNN AURPC), indicating they capture meaningful medical semantics.
Linear Separability: Ark showed the highest linear probing scores, suggesting its features are highly linearly separable for classification, even if its raw retrieval scores were slightly lower than BiomedCLIP.

5. Significance and Conclusion

Paradigm Shift: The paper argues for a shift from building specialized, data-hungry CBIR systems to leveraging versatile, pre-trained foundation models. This is particularly valuable in data-sparse scenarios or for rapid deployment across diverse modalities.
Scalability: Foundation models eliminate the need for expensive labeling and fine-tuning for new retrieval tasks, offering a "plug-and-play" solution.
Limitations & Future Work:
- Current off-the-shelf models struggle with pathology retrieval compared to anatomy.
- Future work should focus on re-ranking strategies (using global features for initial retrieval and local features for refinement) and attention mechanisms to guide models toward pathological regions of interest.
- Training future foundation models specifically on high-resolution, domain-specific image-text pairs (CLIP paradigm) is recommended to further close the gap with specialized models.

In summary, while specialized models remain the accuracy leaders, BiomedCLIP and similar foundation models offer a highly effective, general-purpose alternative for radiological CBIR, capable of handling diverse modalities and pathologies without task-specific training.