Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

This study demonstrates that for cell-level histopathological image analysis under extreme spatial constraints, task-specific architectures trained on sufficient data outperform foundation models in both accuracy and efficiency, while offering comparable robustness to blur perturbations.

Hiroki Kagiyama, Toru Nagasaka, Yukari Adachi, Takaaki Tachibana, Ryota Ito, Mitsugu Fujita, Kimihiro Yamashita, Yoshihiro Kakeji

Published 2026-03-05
📖 4 min read☕ Coffee break read

The Big Picture: Looking at a Grain of Sand

Imagine you are trying to identify different types of ants. Usually, scientists look at a whole ant colony or a large group of ants to figure out what species they are. They have plenty of space to see the ants' legs, wings, and how they move together.

But in this study, the researchers had a much harder job. They were forced to look at one single ant cell at a time, and the image they had to work with was incredibly tiny—like trying to identify an ant by looking at just one single grain of sand that the ant is standing on.

In the world of medical imaging, this is called "cell-level analysis" on "small patches" (40x40 pixels). It's like trying to recognize a famous actor's face when you are only allowed to look at a single pixel of their nose.

The Question: Do "Super-Brains" Help?

In the world of Artificial Intelligence (AI), there are "Foundation Models." Think of these as super-brains that have read millions of books and looked at millions of standard-sized photos (like cats, dogs, and cars) to learn how to recognize things. They are huge, expensive, and very powerful.

The researchers asked: "If we give these super-brains a tiny, grain-of-sand image, will they still be the best at identifying the cell?"

They compared these giant, pre-trained super-brains against smaller, custom-built "task-specific" models (like a specialized tool made just for this one job).

The Experiment: Training the Models

The researchers gathered a massive amount of data: 185,000 tiny cell images from colorectal cancer patients. They then trained different types of AI models on this data, starting with very small amounts and gradually increasing the size, like a student studying for a test.

They tested three main things:

  1. How much data do they need? (Does the super-brain need less study time than the custom tool?)
  2. How fast are they? (Can they make a diagnosis quickly?)
  3. Are they tough? (What happens if the image is blurry, like looking through a dirty window?)

The Surprising Results

1. The "Super-Brain" Hits a Wall

At first, when the researchers had very little data (like a student who only studied for 1 hour), the Foundation Models (Super-Brains) were the winners. Because they had already learned so much from other photos, they could guess the answer even with very little new information.

However, as the researchers gave the models more and more data (up to 16,000 samples per class), the Super-Brains stopped improving. They hit a ceiling. They couldn't learn any more from the tiny grain-of-sand images because their training was based on big, clear photos.

2. The "Custom Tool" Wins with Practice

The Custom-Built Models (specifically a type called CustomViT, which is a Vision Transformer designed for small patches) started slow. They needed a lot of data to get good. But once they had enough data, they kept getting better and better.

Eventually, the CustomViT crushed the Super-Brains. It achieved a much higher accuracy (92% vs 78%) and did it much faster.

  • Analogy: Imagine the Super-Brain is a famous chef who knows how to cook a 10-course banquet but struggles to make a perfect single cracker. The CustomViT is a specialized cracker-maker who, after practicing a lot, makes the perfect cracker every time, faster and cheaper than the famous chef.

3. The "Blur" Test

The researchers also tested what happens if the images are blurry (like a camera out of focus).

  • Result: Both the Super-Brains and the Custom Tools got confused when the images were very blurry.
  • Key Takeaway: Having a "Super-Brain" didn't make the AI more immune to blur. If the picture is too fuzzy, even the smartest AI can't see the details. The Super-Brain wasn't "tougher"; it just started with a higher score on clear pictures.

The Final Verdict

When you have a tiny, low-resolution image (like a single cell):

  • Don't rely on the "Super-Brain" (Foundation Models) if you have enough data. They are too big, too slow, and they can't adapt well to such small details. They are like using a sledgehammer to crack a nut.
  • Do build a "Custom Tool" (Task-Specific Model). If you have a decent amount of data, a model built specifically for tiny images is faster, cheaper, and more accurate.

In short: For looking at tiny cells, a specialized, custom-built AI is the best doctor. The giant, pre-trained AI models are great for big pictures, but they get lost when the view gets too small.