Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

Imagine you are trying to identify a specific animal in a blurry, distant photo. How you look at that photo changes everything. Do you squint and look at the whole picture as one big blob? Or do you zoom in, piece by piece, to see the texture of the fur, the shape of the ear, and the curve of the tail?

This is exactly what this research paper is about, but instead of animals, they are looking at medical images (like X-rays, CT scans, and MRIs) using a type of AI called a Vision Transformer (ViT).

Here is the breakdown of their discovery in simple terms:

The Problem: The "Zoom Level" Dilemma

In the world of AI, a Vision Transformer works by chopping an image into small squares called "patches." Think of these patches like tiles in a mosaic.

Large Patches: Imagine looking at a 14x14 grid of tiles. Each tile is huge. You see the general shape of the object, but you miss the tiny details.
Small Patches: Imagine looking at a 1x1 grid where every single pixel is its own tile. You see every tiny detail, but there are thousands of tiles to process.

For a long time, most researchers just picked a "standard" tile size (usually 14x14) and didn't ask: "What if we used smaller tiles? Would that help the AI see the disease better?"

The Experiment: A Medical Detective Story

The researchers decided to play detective. They took 12 different medical datasets (some were flat 2D images like X-rays, others were 3D volumes like CT scans) and tested the AI with different "zoom levels" (patch sizes).

They tested patch sizes ranging from 28 (very zoomed out, seeing the whole image as one big chunk) down to 1 (zoomed in so far you see every single pixel).

The Analogy:
Think of the AI as a student taking a test.

Patch Size 28: The student is given a blurry photo of a tumor and asked, "Is this cancer?" They guess based on the general shape.
Patch Size 1: The student is given a high-resolution microscope view. They can see the individual cells. They can say, "Yes, this is cancer because I see these specific cell structures."

The Big Discovery: "The Smaller, The Better"

The results were surprising but clear: The AI got significantly better at diagnosing diseases when it looked at smaller patches.

For 2D Images (like X-rays): Using tiny patches improved accuracy by up to 12.8%.
For 3D Images (like CT scans): The improvement was massive, up to 23.8%.

Why?
Medical diseases often hide in tiny details. A large patch might miss a small fracture in a bone or a tiny nodule in a lung because it's too "blurry" at that scale. By using smaller patches, the AI can focus on the fine-grained details that actually matter for a diagnosis.

The Catch: The "Fuel" Cost

There is a trade-off.

Large Patches: The AI is lazy. It processes the image quickly and uses very little computer power (fuel).
Small Patches: The AI is a hard worker. It has to look at thousands of tiny tiles instead of a few big ones. This requires much more computer power.

The Analogy:
Imagine driving a car.

Large Patches are like driving on a highway at 60 mph. It's fast and uses little gas.
Small Patches are like driving through a dense city, stopping at every single intersection to check the traffic lights. You get there more accurately, but you burn way more gas and take longer.

The researchers found that for 3D scans, making the patches half the size made the computer work 64 times harder. That's a huge price to pay!

The "Super-Team" Solution

To get the best of both worlds, the researchers tried a trick called Ensembling.
Imagine you have three doctors:

Doctor A looks at the image with medium zoom.
Doctor B looks with high zoom.
Doctor C looks with extreme zoom.

Instead of picking just one doctor, they asked all three to give their opinion and averaged the results. This "Super-Team" approach often gave the highest accuracy of all, combining the speed of the big patches with the detail of the small ones.

The Bottom Line

This paper tells us two important things for the future of medical AI:

Don't settle for the standard settings. If you want an AI to diagnose diseases accurately, you should try "zooming in" (using smaller patches) to catch the tiny details.
It's a balancing act. You have to decide if you have enough computer power to handle the "zoomed-in" view. If you do, the AI will be a much better doctor.

The researchers also made their code public, so other scientists can try this "zoom-in" strategy on their own medical projects without needing a supercomputer (they managed to do it all on a single, standard graphics card).

1. Problem Statement

Vision Transformers (ViTs) have become state-of-the-art in computer vision, often serving as backbones for large-scale foundation models. However, a critical design choice in ViTs—the patch size (the resolution at which an image is tokenized)—remains underexplored, particularly in the medical domain.

The Gap: Most existing ViT studies rely on fixed, large patch sizes (e.g., $14\times14$ or $16\times16$ ) derived from natural image datasets. It is unclear how varying patch sizes affects performance on medical images, which differ structurally from natural images and include both 2D (e.g., X-ray, Dermoscopy) and 3D (e.g., CT, MRI) modalities.
The Trade-off: While smaller patches theoretically capture finer spatial details, they exponentially increase computational complexity due to the quadratic scaling of the self-attention mechanism with the number of tokens. Previous studies often required massive computational clusters (hundreds of GPUs) to test these variations, making them inaccessible to many research groups.
Objective: This study aims to systematically evaluate the impact of progressively smaller patch sizes on ViT classification performance across diverse 2D and 3D medical datasets, specifically focusing on fine-tuning (the standard approach for medical AI) rather than training from scratch.

2. Methodology

Datasets

The authors utilized a subset of the MedMNIST V2 collection, comprising 12 datasets:

7 2D Datasets: Including BloodMNIST, BreastMNIST, DermaMNIST, OCTMNIST, OrganSMNIST, PneumoniaMNIST, and RetinaMNIST.
5 3D Datasets: Including AdrenalMNIST3D, FractureMNIST3D, NoduleMNIST3D, SynapseMNIST3D, and VesselMNIST3D.
Resolution Constraint: To ensure feasibility on a single GPU, all images were resized to $28\times28$ (2D) and $28\times28\times28$ (3D).

Model Architecture & Fine-Tuning

Base Model: ViT-Small (12 layers, 384 embedding dimension, 6 attention heads, ~22M parameters).
Pre-training: Models were initialized with ImageNet-pretrained weights.
2D Adaptation: The classification head was replaced to match target classes; the encoder was fine-tuned.
3D Adaptation: A weight inflation method was used to convert 2D pretrained weights into 3D kernels by repeating weights along the depth axis. Positional embeddings were interpolated using trilinear interpolation to fit the 3D grid.

Experimental Setup

Patch Sizes Evaluated: $P \in \{1, 2, 4, 7, 14, 28\}$ $P \in {1, 2, 4, 7, 14, 28}$ . These are factors of the input resolution ($28$), ensuring non-overlapping patches without padding.
- $P=28$ : Global image tokenization (1 token).
- $P=1$ : Maximum granularity (28 tokens per dimension).
Hardware: Single NVIDIA RTX 4090 GPU.
Training: 80 epochs, AdamW optimizer ( $lr=10^{-4}$ ), cross-entropy loss.
Ensemble Strategy: A simple averaging ensemble fusing predictions from models trained with patch sizes 1, 2, and 4.
Metrics: Accuracy (Acc.), Balanced Accuracy (Bal. Acc.), Area Under the Curve (AUC), and computational cost (GFLOPs).

3. Key Contributions

Systematic Patch Size Analysis: The first comprehensive evaluation of patch size effects on ViT performance across both 2D and 3D medical imaging modalities.
Feasibility Demonstration: Proved that detailed tokenization studies can be conducted on a single modest GPU by utilizing appropriately sized datasets, lowering the barrier for similar research.
3D ViT Adaptation: Detailed the methodology for fine-tuning 2D-pretrained ViTs on 3D volumetric data via weight inflation and positional interpolation.
Open Source: The code and implementation are publicly available on GitHub to ensure reproducibility.

4. Key Results

Performance Trends

General Finding: Smaller patch sizes consistently yield better classification performance across nearly all datasets.
2D Datasets:
- Patch sizes 1, 2, and 4 achieved the best results.
- Patch Size 2 generally offered the best balance, outperforming the standard Patch Size 28 significantly.
- Improvement: Up to 12.78% increase in Balanced Accuracy (e.g., OrganSMNIST: Patch 2 vs. Patch 28).
3D Datasets:
- Patch Size 1 achieved the best or second-best performance across most 3D datasets.
- Improvement: Up to 23.78% increase in Balanced Accuracy (e.g., VesselMNIST3D: Patch 1 vs. Patch 14).
- Larger patches (14, 28) resulted in the worst performance, likely due to the loss of critical volumetric details.
Ensemble Strategy: Fusing models with patch sizes 1, 2, and 4 provided further performance boosts, particularly in 2D datasets, suggesting that multi-scale token integration is beneficial.

Computational Cost vs. Performance

Scaling Law: Computational cost (GFLOPs) increases drastically as patch size decreases.
- 2D: Halving the patch size increases cost by $\approx 4\times$ (due to $N^4$ scaling of attention).
- 3D: Halving the patch size increases cost by $\approx 64\times$ (due to $N^6$ scaling).
- Example: Reducing 3D patch size from 28 to 1 increased GFLOPs from ~0.4 to >800 per volume.
Trade-off: While smaller patches offer superior accuracy, they impose a heavy computational burden, which may hinder real-time clinical deployment on limited hardware.

Attention Analysis

Visual inspection of attention maps revealed that models with smaller patches (e.g., $P=2$ ) exhibited focused attention on diagnostically relevant regions.
Models with large patches (e.g., $P=28$ ) showed uniform, diffuse attention patterns, failing to localize specific features, leading to misclassifications.

5. Significance and Limitations

Significance

Design Guidance: The study provides concrete evidence that for medical image classification, finer tokenization (smaller patches) is generally superior to the standard large-patch approach used in natural image ViTs.
Clinical Relevance: Medical images often contain subtle, fine-grained textures (e.g., nodules, micro-calcifications) that are lost when using large patches.
Accessibility: Demonstrates that high-resolution tokenization analysis does not strictly require supercomputing resources if the dataset resolution is managed appropriately.

Limitations

Model Capacity: Experiments were limited to ViT-Small due to GPU memory constraints. Larger models (ViT-Large/Huge) might show different scaling behaviors.
Dataset Resolution: The study used low-resolution inputs ( $28\times28$ ) from MedMNIST. It remains to be seen if these trends hold for high-resolution, real-world clinical images (e.g., $512\times512$ or higher).
Computational Cost: The massive increase in FLOPs for 3D models with small patches poses a significant barrier for deployment in resource-constrained hospital environments.

Conclusion

The paper concludes that while smaller patch sizes incur a steep computational penalty, they significantly enhance the ability of Vision Transformers to capture clinically relevant, fine-grained features in both 2D and 3D medical images. The authors recommend using smaller patches (specifically 1, 2, or 4) for medical classification tasks, potentially combined with ensemble strategies, provided that computational resources allow.