Revisiting the Role of Foundation Models in Cell-Level Histopathological Image Analysis under Small-Patch Constraints -- Effects of Training Data Scale and Blur Perturbations on CNNs and Vision Transformers

The Big Picture: Looking at a Grain of Sand

Imagine you are trying to identify different types of ants. Usually, scientists look at a whole ant colony or a large group of ants to figure out what species they are. They have plenty of space to see the ants' legs, wings, and how they move together.

But in this study, the researchers had a much harder job. They were forced to look at one single ant cell at a time, and the image they had to work with was incredibly tiny—like trying to identify an ant by looking at just one single grain of sand that the ant is standing on.

In the world of medical imaging, this is called "cell-level analysis" on "small patches" (40x40 pixels). It's like trying to recognize a famous actor's face when you are only allowed to look at a single pixel of their nose.

The Question: Do "Super-Brains" Help?

In the world of Artificial Intelligence (AI), there are "Foundation Models." Think of these as super-brains that have read millions of books and looked at millions of standard-sized photos (like cats, dogs, and cars) to learn how to recognize things. They are huge, expensive, and very powerful.

The researchers asked: "If we give these super-brains a tiny, grain-of-sand image, will they still be the best at identifying the cell?"

They compared these giant, pre-trained super-brains against smaller, custom-built "task-specific" models (like a specialized tool made just for this one job).

The Experiment: Training the Models

The researchers gathered a massive amount of data: 185,000 tiny cell images from colorectal cancer patients. They then trained different types of AI models on this data, starting with very small amounts and gradually increasing the size, like a student studying for a test.

They tested three main things:

How much data do they need? (Does the super-brain need less study time than the custom tool?)
How fast are they? (Can they make a diagnosis quickly?)
Are they tough? (What happens if the image is blurry, like looking through a dirty window?)

The Surprising Results

1. The "Super-Brain" Hits a Wall

At first, when the researchers had very little data (like a student who only studied for 1 hour), the Foundation Models (Super-Brains) were the winners. Because they had already learned so much from other photos, they could guess the answer even with very little new information.

However, as the researchers gave the models more and more data (up to 16,000 samples per class), the Super-Brains stopped improving. They hit a ceiling. They couldn't learn any more from the tiny grain-of-sand images because their training was based on big, clear photos.

2. The "Custom Tool" Wins with Practice

The Custom-Built Models (specifically a type called CustomViT, which is a Vision Transformer designed for small patches) started slow. They needed a lot of data to get good. But once they had enough data, they kept getting better and better.

Eventually, the CustomViT crushed the Super-Brains. It achieved a much higher accuracy (92% vs 78%) and did it much faster.

Analogy: Imagine the Super-Brain is a famous chef who knows how to cook a 10-course banquet but struggles to make a perfect single cracker. The CustomViT is a specialized cracker-maker who, after practicing a lot, makes the perfect cracker every time, faster and cheaper than the famous chef.

3. The "Blur" Test

The researchers also tested what happens if the images are blurry (like a camera out of focus).

Result: Both the Super-Brains and the Custom Tools got confused when the images were very blurry.
Key Takeaway: Having a "Super-Brain" didn't make the AI more immune to blur. If the picture is too fuzzy, even the smartest AI can't see the details. The Super-Brain wasn't "tougher"; it just started with a higher score on clear pictures.

The Final Verdict

When you have a tiny, low-resolution image (like a single cell):

Don't rely on the "Super-Brain" (Foundation Models) if you have enough data. They are too big, too slow, and they can't adapt well to such small details. They are like using a sledgehammer to crack a nut.
Do build a "Custom Tool" (Task-Specific Model). If you have a decent amount of data, a model built specifically for tiny images is faster, cheaper, and more accurate.

In short: For looking at tiny cells, a specialized, custom-built AI is the best doctor. The giant, pre-trained AI models are great for big pictures, but they get lost when the view gets too small.

1. Problem Statement

The study addresses a critical gap in digital pathology: the applicability of modern deep learning architectures and foundation models to extreme small-patch constraints.

The Constraint: Standard computer vision models (e.g., ResNet, ViT) and foundation models are pre-trained on ImageNet with inputs of 224×224 pixels. However, cell-level histopathological analysis often requires analyzing patches as small as 40×40 pixels (approx. 1/30th the area of standard inputs) to isolate single cells.
The Challenge: It is unclear whether foundation models, which rely on multi-scale spatial features learned from large images, can effectively transfer to this regime. Simply resizing 40×40 patches to 224×224 introduces artifacts or destroys fine morphological details.
Key Questions:
1. Can modern architectures learn robust representations from 40×40 pixel inputs?
2. Which paradigm (CNN, Transformer, or Hybrid) is most suitable?
3. Do foundation models offer superior robustness to blur compared to task-specific models trained from scratch?

2. Methodology

Dataset and Annotation

Source: 303 colorectal cancer specimens with CD103/CD8 immunostaining.
Scale: Generated 185,432 annotated cell images.
Annotation: Performed using Cu-Cyto® Viewer with a human-AI collaborative workflow, verified by pathologists to ensure high accuracy.
Sampling: Data was balanced using a stratified sampling strategy across seven data scales (FlagLimit: 256 to 16,384 samples per class) to study scaling laws.

Architectures Evaluated

The study compared 8 task-specific models (trained from scratch) against 3 foundation models (evaluated via linear probing and fine-tuning).

Task-Specific Models (Input: 40×40):
- Baselines: MLP, CNN, ResNet-D4.
- Advanced CNNs: NIN (Network-in-Network), SE-ResNet-D4 (with Squeeze-and-Excitation), EfficientNet-B0, ConvNeXt-Tiny.
- Transformer: CustomViT (specifically designed for 40×40 inputs, dividing images into 8×8 patches).
Foundation Models (Input resized to 224×224):
- ResNet-RS50, CTransPath (Swin Transformer), UNI (Vision Transformer).
- Evaluated via Linear Probing (LP) and Fine-Tuning the Last Layer (FT_last).

Experimental Design

Data Augmentation: Extensive geometric (flips) and color-space transformations (gamma correction, HSV) to expand the training set by ~9.6x.
Robustness Testing: Evaluated against Gaussian blur in two modes:
1. Pre-resize: Applied to original resolution (simulating optical defocus).
2. Post-resize: Applied to the 40×40 input (simulating digital degradation).
Metrics: Accuracy, Macro-F1, inference time, and parameter count.

3. Key Contributions

Systematic Benchmarking: First comprehensive comparison of modern architectures (including Transformers and EfficientNets) specifically for 40×40 pixel cell classification.
Scaling Law Analysis: Demonstrated the "crossover point" where task-specific models trained from scratch outperform foundation models as data volume increases.
Architectural Insights: Identified that CustomViT is uniquely suited for small patches, while standard CNNs and foundation models face specific limitations in this regime.
Robustness Findings: Challenged the assumption that foundation models are inherently more robust, showing that blur robustness is comparable across architectures when spatial constraints are equal.

4. Key Results

Performance vs. Data Scale

**Low Data Regime (<1,024 samples/class):** Foundation models (especially **UNI** with fine-tuning) significantly outperformed task-specific models (Macro-F1 > 0.8 vs. < 0.6).
Moderate to High Data Regime (≥2,048 samples/class):
- CustomViT showed monotonic performance gains, surpassing all foundation models at FlagLimit = 4,096.
- CustomViT achieved 0.92 Accuracy and 0.92 Macro-F1, outperforming the best foundation model (UNI FT_last: 0.78 F1).
- Conventional CNNs (ResNet, ConvNeXt) improved with data but saturated below the performance of foundation models.
- EfficientNet showed early gains but failed to scale due to prohibitive training costs at larger data sizes.

Efficiency and Inference

CustomViT offered the best trade-off:
- Inference Time: ~1.78 ms per patch.
- Foundation Models: UNI required ~24.89 ms per patch (14x slower) and >1GB storage.
- CustomViT achieved superior accuracy with order-of-magnitude lower computational cost.

Robustness to Blur

Threshold Behavior: All models remained stable under mild blur ( $\sigma \le 0.4$ ) but degraded rapidly at higher levels ( $\sigma \ge 0.8$ ).
No Qualitative Advantage: Foundation models did not demonstrate superior robustness to blur compared to task-specific models.
ViT Sensitivity: While CustomViT had the highest clean accuracy, it also showed the steepest performance drop under severe blur, indicating that high clean accuracy does not guarantee intrinsic robustness.

Architectural Specific Findings

SE-ResNet & EfficientNet: Underperformed baselines. The authors hypothesize that aggressive channel reweighting (SE blocks) suppresses the subtle, low-level features critical in small patches.
ConvNeXt: Failed to improve over standard CNNs, likely due to premature loss of spatial information via large effective receptive fields and global pooling.
CTransPath: Performed poorly (F1 ~0.15), suggesting Swin Transformer's hierarchical windowed attention is incompatible with the lack of spatial hierarchy in 40×40 patches.

5. Significance and Conclusion

Paradigm Shift: The study challenges the prevailing assumption that foundation models are universally superior in medical imaging. For cell-level analysis under extreme spatial constraints, task-specific architectures trained from scratch are more effective and efficient once sufficient data is available.
Optimal Strategy:
- < 1,000 samples: Use Foundation Models (Linear Probing/Fine-tuning).
- > 2,000–4,000 samples: Switch to CustomViT (or similar small-patch optimized Transformers) trained from scratch.
Practical Impact: CustomViT provides a clinically viable solution that balances high accuracy with low inference latency, making it suitable for large-scale screening where foundation models are too computationally expensive.
Design Guidance: The results suggest that for low-resolution inputs, preserving spatial locality (via patch tokens) and avoiding aggressive channel recalibration or large receptive fields are critical design principles.