Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy

Imagine you are a scientist looking at a massive, complex map under a microscope. This map is filled with tiny cities (cells), roads (tissues), and buildings (organelles). Your job is to do two things:

Pixel Classification: Color-code the map. "This area is a road, that area is a building, and this patch is a park."
Object Classification: Identify the specific buildings. "That is a school, that is a hospital, and that is a factory."

For a long time, scientists have done this by hand-picking specific rules (like "roads are gray and straight") and feeding them into a simple computer program. It works, but it's slow and often misses the nuances.

Recently, a new generation of "Super-Intelligent AI" models (called Vision Foundation Models or VFMs) has arrived. These are like massive, pre-trained brains that have seen millions of images from the internet. They are amazing at understanding shapes and objects. But the big question was: Can these giant, general-purpose brains help us with these tiny, specific microscope maps without needing to be retrained from scratch?

This paper is the ultimate "road test" to find out.

The Cast of Characters

To solve the problem, the researchers tried two different strategies, using a lineup of different AI models:

1. The "Smart Assistant" (The Models)

The Generalists: Models like SAM and DINO. These are like a Swiss Army Knife or a general encyclopedia. They know a little bit about everything.
The Specialists: Models like µSAM and PathoSAM. These are like specialized mechanics who only fix microscopes. They were trained specifically on biology images.

2. The "Learning Strategies" (How we use the AI)

Strategy A: The Quick Sketch (Random Forest)
- The Analogy: Imagine you have a giant, smart encyclopedia (the VFM). You ask it to describe a few spots on your map. You then take those descriptions and feed them to a very fast, simple calculator (a Random Forest).
- The Benefit: It's incredibly fast. You can draw a few lines on the screen, and the calculator instantly learns the pattern. It's like teaching a child by showing them three examples.
Strategy B: The Deep Dive (Attentive Probing - DeAP/ObAP)
- The Analogy: Instead of just asking for a description, you hook a tiny, specialized neural network (a "probe") directly into the giant AI's brain. You let this probe "look" at the map through the AI's eyes and learn to make decisions.
- The Benefit: It's much smarter and more accurate, but it takes longer to "think" and train. It's like hiring a senior expert to study the map for an hour before giving you an answer.

The Race: What Happened?

The researchers tested these combinations on five different types of microscopic maps (from cancer tissue to tiny flatworms). Here is what they found:

1. The "Quick Sketch" (Random Forest) is the King of Speed

If you need to work interactively (drawing on the screen and seeing results instantly), this is the winner.
The Surprise: Even the "Generalist" models (like SAM) worked better than the old, hand-crafted rules. But the Specialist models (µSAM) were the absolute champions here. They were the perfect fit for the job.
Result: You get high accuracy with very little effort.

2. The "Deep Dive" (Attentive Probing) is the King of Quality

If you have time to wait a bit for the computer to train, this method is unbeatable.
The Surprise: The Generalist model SAM2 (the newer, video-capable version) crushed the competition here, even beating the specialized models. It seems that for deep learning, being a "generalist" with a huge brain is better than being a narrow specialist.
The Magic: This method was so good that it could learn to classify cells perfectly using only 100 annotated examples. To put that in perspective, a traditional deep learning model might need 100,000 examples to do the same job. It's like learning to recognize a cat after seeing it once, instead of seeing it a thousand times.

3. The "DINO" Model

This model (DINOv3) was a bit of a disappointment. It's like a brilliant philosopher who knows everything about art but gets confused when looking at a microscope slide. It didn't perform as well as the others.

The Big Takeaway

The paper gives us a clear roadmap for the future of microscope analysis:

For the "I need it now" scientist: Use a Specialist Model (µSAM) combined with a Quick Sketch (Random Forest). It's fast, interactive, and surprisingly smart.
For the "I need the best possible result" scientist: Use the Generalist Model (SAM2) combined with the Deep Dive (Attentive Probing). It requires more computing power, but it can learn from tiny amounts of data and produce results that are better than even the most expensive, fully-trained AI systems.

In short: We no longer need to build a new, massive AI from scratch for every new microscope experiment. We can just borrow a "Super-Brain" (Foundation Model), give it a tiny nudge (a few annotations), and it can solve the puzzle for us. This turns a months-long project into a few hours of work.

1. Problem Statement

While Deep Learning (DL) and Vision Foundation Models (VFMs) have revolutionized cellular instance segmentation in microscopy (e.g., via SAM, µSAM), their application to pixel classification (semantic segmentation) and object classification (categorizing segmented cells) remains limited.

Current State: These tasks still rely heavily on shallow learning (e.g., Random Forests) using hand-crafted features (as in ilastik) due to the diversity of microscopy data, lack of large pre-training datasets, and the need for label efficiency and interactivity.
The Gap: Supervised DL (e.g., U-Net, ResNet) requires extensive annotated data and computational resources, making it impractical for interactive workflows where users provide sparse annotations. Conversely, existing DL-based pixel/object classification methods lack systematic evaluation regarding design choices (model architecture vs. learning strategy).
Goal: To determine if VFMs can improve pixel and object classification over classical methods and to establish a benchmark for their practical application in microscopy.

2. Methodology

The authors propose a systematic evaluation framework comparing General-Purpose VFMs (SAM, SAM2, DINOv3) and Domain-Specific VFMs (µSAM, PathoSAM) against classical baselines. They investigate two primary learning strategies:

A. Feature Extraction & Upsampling

Backbone: VFMs extract dense feature embeddings from input images.
Resolution Handling: Since VFM features are often low-resolution, the authors use AnyUp (a learned upsampling method) to map low-resolution embeddings to high-resolution feature maps ( $256 \times 256$ or $1024 \times 1024$ ) compatible with pixel-level tasks.

B. Learning Strategies

Shallow Learning (Random Forest - RF):
- Pixel Classification: Sparse user annotations are projected onto the upsampled feature map. A Random Forest is trained on these feature-label pairs.
- Object Classification: Instance masks aggregate pixel-level embeddings (mean pooling) to create object-level features. Additional features like object area are added. A Random Forest is trained on these object features.
- Advantage: Extremely fast training and inference, enabling real-time interactivity.
Attentive Probing (Deep Learning Adapters):
- Pixel Classification (DeAP): Uses Dense Attentive Probing. A lightweight probe is trained on top of frozen VFM features. It uses cross-attention with a learnable Gaussian mask to attend to spatial locations, producing dense predictions.
- Object Classification (ObAP): A novel extension of DeAP called Object-Guided Attentive Probing. Instead of a regular grid, queries are initialized at the center of each object mask. These queries attend to the feature volume via Gaussian-masked cross-attention, followed by a lightweight MLP to predict class labels.
- Advantage: Higher accuracy and better handling of complex boundaries, but requires more training time.

3. Experimental Setup

Datasets: Five diverse microscopy datasets were used:
1. LIVECell: Phase-contrast, 8 cell lines.
2. CRC: Multiplexed fluorescence (CODEX), colorectal cancer.
3. HBM: Multiplexed fluorescence (CODEX), human bone marrow.
4. PanNuke: Histopathology (H&E), nuclei segmentation.
5. Planari: Phenotypic screening of flatworms.
Baselines:
- Classical ML: Hand-crafted features (filter banks) + Random Forest (ilastik style).
- Supervised DL: U-Net (pixel) and ResNet18 (object) trained on full datasets.
Metrics: Mean F1-score, label efficiency (performance vs. number of annotated pixels/objects), and computational efficiency (training/inference time).

4. Key Results

Pixel Classification

Random Forest + VFM: Outperformed hand-crafted features in most datasets. Domain-specific models (µSAM, PathoSAM) generally performed best with RF, while DINOv3 performed worst.
DeAP (Attentive Probing): Consistently outperformed Random Forests.
- Data Efficiency: DeAP trained on just 100 annotated pixels achieved performance comparable to or exceeding RF models trained on 100,000 pixels.
- Model Performance: SAM2 achieved the highest overall performance with DeAP, surpassing fully supervised U-Net baselines on CRC, PanNuke, and LIVECell.
- Trade-off: DeAP training is slower than RF (comparable to U-Net) but offers superior accuracy.

Object Classification

Random Forest + VFM: Consistently outperformed hand-crafted features. Domain-specific models (µSAM, PathoSAM) often surpassed ResNet18, especially on PanNuke, LIVECell, and CRC.
ObAP (Attentive Probing): Outperformed Random Forests in most regimes.
- SAM2 again yielded the best results.
- On LIVECell, PanNuke, and CRC, ObAP surpassed the supervised ResNet baseline.
- DINOv3 remained the weakest performer across both tasks.

5. Key Contributions

Systematic Benchmark: The first comprehensive study evaluating VFMs specifically for pixel and object classification in microscopy, establishing a new benchmark.
Novel Method (ObAP): Introduction of Object-Guided Attentive Probing, extending dense probing techniques to object-level classification tasks.
Design Guidelines:
- For Interactivity: Use Random Forests with Domain-Specific VFMs (µSAM/PathoSAM). This offers the best balance of speed and accuracy for interactive workflows.
- For Maximum Accuracy: Use Attentive Probing (DeAP/ObAP) with SAM2. This achieves state-of-the-art results, often beating fully supervised models, with extreme label efficiency.
- Model Selection: SAM2 is the preferred backbone for probing; Domain-specific models are preferred for shallow learning; DINOv3 is generally less effective for these specific microscopy tasks.
Practical Pathway: Demonstrated that VFMs can replace hand-crafted features, enabling high-quality, interactive analysis tools without the massive data requirements of traditional supervised DL.

6. Significance

This work bridges the gap between the high accuracy of deep learning and the interactivity required in biomedical research. It proves that foundation models can be effectively adapted for classification tasks (not just segmentation) with minimal supervision. The findings provide a clear roadmap for tool developers (e.g., ilastik, CellPose) to integrate VFMs, potentially automating complex analysis tasks that currently require extensive manual annotation and domain expertise. The study also highlights the importance of domain-specific pretraining and the superior adaptability of SAM-based architectures in the microscopy domain.