Leveraging Foundation Models for Content-Based Image Retrieval in Radiology

This paper demonstrates that off-the-shelf vision foundation models, particularly the weakly-supervised BiomedCLIP, can serve as highly effective, general-purpose feature extractors for content-based image retrieval in radiology, achieving performance comparable to specialized systems across 1.6 million images and 161 pathologies without requiring additional training.

Stefan Denner, David Zimmerer, Dimitrios Bounias, Markus Bujotzek, Shuhan Xiao, Raphael Stock, Lisa Kausch, Philipp Schader, Tobias Penzkofer, Paul F. Jäger, Klaus Maier-Hein

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are a doctor walking into a massive, chaotic library. This library doesn't have books; it has 1.6 million medical images (X-rays, MRIs, CT scans, and ultrasounds) from patients all over the world.

Your goal? You see a patient with a strange spot on their lung. You want to find other images in this giant library that look exactly like that spot so you can see how other doctors handled similar cases. This is called Content-Based Image Retrieval (CBIR). It's like asking a librarian, "Show me everything that looks like this," rather than "Show me everything labeled 'Lung Cancer'."

The Problem: The Old Librarians Were Too Specialized

In the past, the "librarians" (computer systems) built for this job were like specialized experts.

  • One librarian knew everything about broken bones but knew nothing about heart issues.
  • Another knew everything about the liver but couldn't tell a healthy lung from a sick one.

To use them, you had to hire a different librarian for every single disease. This was slow, expensive, and required a lot of training data for each specific job. If a patient had a rare disease no one had seen before, the librarian would just shrug and say, "I don't know."

The New Solution: The "Super-Reader" Foundation Models

The researchers in this paper asked a different question: What if we used a "Super-Reader" who has already read almost every book in the world?

These "Super-Readers" are called Foundation Models. They are massive AI systems trained on billions of images and text descriptions (like scientific articles). They haven't been taught specifically to find lung spots; they've just learned to understand everything about pictures—shapes, textures, organs, and diseases—just by looking at them.

The researchers tested these Super-Readers to see if they could act as a universal librarian without needing any extra training.

The Experiment: The Great Library Test

The team built a massive test library with 1.6 million images covering 161 different diseases and 24 body parts. They asked various AI models to find matching images.

Here is what they found, using some simple analogies:

1. The "Text-Book" Winners (BiomedCLIP)
The best performers were models like BiomedCLIP. Imagine these models as students who didn't just look at pictures; they read the captions and medical reports alongside the pictures.

  • Why they won: Because they learned the connection between the image and the words describing it, they understood the "meaning" of the picture, not just the pixels.
  • The Score: They got about 59% of the top matches correct immediately. That's impressive for a system that wasn't specifically trained for this one task!

2. The "Specialist" (The Old Way)
They also trained a "Specialist" model from scratch on their specific library.

  • The Result: The Specialist was slightly better (about 65% accuracy).
  • The Catch: To get this extra 6% boost, the Specialist needed thousands of hours of expensive training and massive amounts of labeled data. The Foundation Models (the Super-Readers) got 90% of the way there just by being "off-the-shelf" (ready to use right out of the box).

3. The "Eye" vs. The "Brain" (Anatomy vs. Disease)
The researchers noticed something funny:

  • Anatomy (Body Parts): The AI was great at finding "This is a knee" or "This is a liver." It's like recognizing the shape of a house.
  • Pathology (Diseases): The AI struggled more with "This is a broken bone" or "This is a tumor." It's like trying to spot a tiny crack in the wall of that house.
  • Why? Diseases are often subtle and look very similar to healthy tissue. The AI sometimes got distracted by the big, obvious body parts and missed the small, tricky disease details.

4. The "Blurry Photo" Problem (X-Rays)
The AI worked best on Ultrasounds and CT scans (which are like 3D slices or clear photos) but struggled the most with X-rays.

  • Analogy: X-rays are like looking at a 3D object squashed into a 2D shadow. It's hard to tell what's in front and what's in the back. The AI got confused by the shadows, making it harder to find exact matches.

The "Magic Number" of Images

The team also asked: How many pictures do we need in the library for the AI to get really good?
They found that once you have about 1,000 examples of a specific disease in the library, adding more doesn't help much. The AI hits a "ceiling." To get better after that, you don't need more data; you need a smarter AI.

The Big Takeaway

This paper is a game-changer because it suggests we don't need to build a new, custom AI for every single disease anymore.

  • Before: We needed a different key for every door.
  • Now: We have a Master Key (the Foundation Model) that opens almost all the doors well enough to be useful, without needing to be custom-made for each one.

While these Master Keys aren't perfect yet (they still miss some tricky diseases), they are powerful, versatile, and ready to use immediately. This means hospitals can start using smart image search tools much faster, helping doctors find similar cases and make better decisions, even for rare diseases they haven't seen before.

In short: The researchers proved that the "Super-Readers" of the AI world are ready to become the universal librarians of radiology, saving us time and money while improving patient care.