Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models

The Big Problem: Finding a Needle in a 3D Haystack

Imagine you are a doctor looking at a 3D MRI scan of a brain. Your job is to find a tiny tumor (the "needle") hidden inside the healthy brain tissue (the "haystack").

Usually, to teach a computer to do this, you need to show it thousands of examples of brains with tumors and thousands of healthy brains. This is like hiring a tutor to teach a student for years. But in medicine, getting thousands of labeled examples is expensive, slow, and often impossible because patient data is private.

Zero-Shot Anomaly Detection is a fancy way of saying: "Can we teach a computer to find the needle without showing it any examples of needles first?"

The Old Way vs. The New Way

The Old Way (2D Slices):
Most previous methods treated the 3D brain like a stack of 2D paper slices (like a loaf of bread). They looked at one slice at a time.

The Flaw: If you look at a single slice of a loaf, you might miss a hole that goes through the whole loaf. You lose the "3D shape" of the problem. Also, existing "smart" AI models (Foundation Models) are great at looking at 2D photos but don't know how to handle 3D volumes.

The New Way (CoDeGraph3D):
This paper introduces a method called CoDeGraph3D. It's a "training-free" system, meaning it doesn't need to be taught with medical data. It just uses a pre-trained "smart eye" (a 2D AI model) and a clever trick to see the whole 3D picture.

How It Works: The "Cube" Analogy

Here is the step-by-step process using a simple metaphor:

1. The "Smart Eye" (The 2D Foundation Model)

Imagine you have a super-intelligent robot that has seen millions of photos of cats, dogs, and cars. It knows what "normal" looks like perfectly. However, it has never seen a 3D brain.

The Trick: Instead of trying to teach the robot 3D, we just show it the brain from three different angles at once: Top-down (Axial), Front-facing (Coronal), and Side-view (Sagittal).

2. The "Lego Brick" Strategy (Tokenization)

The brain is huge. If we try to look at every single pixel, the computer's brain (memory) will explode.

The Solution: Instead of looking at every pixel, the system chops the brain into small, invisible 3D cubes (like Lego bricks).
It looks at the same spot on the brain from all three angles (Top, Front, Side) and combines those three views into one single "super-token."
Why? This creates a compact, 3D representation that keeps the spatial context (knowing where things are in 3D space) without needing a supercomputer.

3. The "Party Guest" Test (Batch-Based Detection)

Now, imagine you have a room full of 180 different brain scans (a "batch"). The system asks a simple question: "Who looks like everyone else?"

The Normal Guests: Healthy brain parts look very similar to healthy brain parts in other people. If you pick a healthy patch from Brain A, you will find almost identical matches in Brain B, C, and D. They are the "popular kids" at the party.
The Anomalous Guest: A tumor is weird. If you pick a patch with a tumor, it won't match anything in the other healthy brains. It's the "weirdo" at the party who doesn't fit in.

The system calculates a "strangeness score" for every single 3D cube. If a cube has no friends (no similar matches in the other scans), it gets flagged as an anomaly.

4. The "Compression" Trick

To make this math fast enough to run on a normal graphics card (GPU), the system uses a mathematical trick called Random Projection.

Analogy: Imagine you have a very detailed, high-resolution map of a city. It's too big to carry. You take a photo of the map, but you squish it down to a smaller size. Surprisingly, the distances between the main landmarks (the "geometry") stay roughly the same, even though the map is smaller. This lets the computer do the "Party Guest" test super fast without losing the important details.

Why Is This a Big Deal?

No Training Required: You don't need to feed it thousands of tumor scans. You just plug in the pre-trained AI and the new brain scans. It works immediately.
It Sees in 3D: Unlike older methods that get confused by looking at slices, this method understands the brain as a solid 3D object.
It's Fast and Cheap: It runs on standard computer hardware, making it accessible for hospitals that don't have supercomputers.
It Works: The tests showed it finds tumors better than other "zero-shot" methods and is almost as good as methods that were trained on thousands of examples.

The Catch (Limitations)

The "Lego brick" approach is great, but the bricks are a bit big. If a tumor is tiny (smaller than one of our invisible cubes), the system might miss it or blur it out because the healthy tissue around it "dilutes" the signal. It's like trying to find a single grain of sand in a bucket of sand by looking at the bucket in large chunks; you might miss the single grain.

Summary

This paper is like inventing a new way to find a needle in a haystack without ever having seen a needle before. Instead of looking at the haystack slice-by-slice (which is confusing), they chop the haystack into 3D cubes, look at them from three angles, and ask, "Which cube looks weird compared to all the other haystacks?"

It's a simple, robust, and "training-free" way to help doctors spot brain abnormalities faster and more accurately.

1. Problem Statement

Zero-Shot Anomaly Detection (ZSAD) aims to identify abnormalities in medical images without requiring task-specific training data or supervision. While ZSAD has advanced significantly for 2D images, extending it to 3D volumetric medical imaging (specifically Brain MRI) remains a critical challenge due to two main factors:

Lack of 3D Foundation Models: There are no general-purpose 3D foundation models (like DINOv2 or CLIP) pre-trained on large-scale volumetric medical data.
Limitations of Existing Approaches:
- Slice-wise methods: Applying 2D models slice-by-slice fails to capture the full 3D volumetric context, leading to incoherent anomaly maps.
- Text-based ZSAD: Vision-language models (e.g., CLIP) struggle with the significant domain gap between natural images and medical scans, and crafting robust clinical text prompts is difficult.
- Batch-based methods: While effective for 2D, naive extensions to 3D generate an excessive number of tokens, making mutual similarity computations computationally intractable and memory-prohibitive.

2. Methodology: CoDeGraph3D

The authors propose CoDeGraph3D, a fully training-free framework that adapts 2D foundation models to 3D MRI volumes using a multi-axis volumetric tokenization strategy. The pipeline consists of the following key stages:

A. Multi-Axis 3D-Patch Tokenization

Instead of processing slices individually, the method constructs localized 3D tokens by aggregating features from three anatomical planes (Axial, Coronal, Sagittal).

Axis-wise Extraction: A frozen 2D foundation model (e.g., DINOv2) processes slices along each of the three axes.
Patch-Aligned Pooling: To restore 3D cubic coherence and reduce token count, features from $p$ consecutive slices are grouped and averaged. This creates a 3D patch token representing a $p \times p \times p$ voxel region, effectively downsampling the volume while preserving spatial context.
Random Projection: To further reduce dimensionality and memory load, a fixed Gaussian random matrix projects the high-dimensional features into a lower-dimensional space ( $k \ll D$ , e.g., $k=128$ ). This preserves pairwise distances (Johnson-Lindenstrauss lemma) essential for anomaly scoring.
Multi-View Fusion: The projected features from all three axes are concatenated at each spatial location to form a unified, semantically rich 3D token.
Background Suppression: A binary brain mask filters out zero-valued background voxels to prevent artificial redundancy in batch statistics.

B. Batch-Based Anomaly Scoring

Once volumes are converted into collections of 3D tokens, the framework applies batch-based anomaly detection (specifically CoDeGraph):

Doppelgänger Assumption: Normal tissue patterns recur across different patients, whereas anomalies are rare and distinctive.
Mutual Similarity Vector (MSV): For each token, the algorithm calculates the distance to its nearest neighbors in other volumes within the batch.
Scoring: Tokens with high nearest-neighbor distances (outliers) are flagged as anomalies. The method handles "consistent anomalies" (recurring pathologies across the batch) by selectively excluding suspicious tokens from the similarity graph to maintain the validity of rarity-based scoring.
Output: The process generates a voxel-level anomaly map, which is resized to the original resolution.

3. Key Contributions

First Training-Free 3D ZSAD Framework: Introduces the first practical batch-based ZSAD framework for 3D brain MRI, successfully extending training-free principles from 2D to volumetric data.
Novel Tokenization Pipeline: Proposes a multi-axis aggregation and random projection strategy that preserves cubic spatial context while drastically reducing token counts, making mutual similarity computations tractable on standard GPUs.
Superior Performance: Demonstrates that this approach outperforms existing CLIP-based ZSAD baselines and, in specific segmentation metrics, matches or exceeds supervised reconstruction-based methods without any fine-tuning.

4. Experimental Results

The framework was evaluated on IXI (healthy) and BraTS-2025 METS (tumor) datasets, covering both T1 and T2-weighted MRI scans.

Quantitative Performance (T2-weighted):
- Patient-level AUROC: CoDeGraph3D achieved 96.9%, significantly outperforming zero-shot CLIP baselines (e.g., WinCLIP at 23.2%, AnomalyCLIP at 36.4%).
- Voxel-level Segmentation (Dice): Achieved 41.3%, compared to <15% for zero-shot CLIP baselines.
- Comparison to Supervised: While supervised methods (trained on BraTS) achieved higher scores, CoDeGraph3D achieved comparable segmentation accuracy to supervised CLIP models without any domain-specific training. It also outperformed unsupervised reconstruction models (DAE) in segmentation accuracy.
Efficiency: Processing 180 volumes took ~714 seconds total (4 seconds/volume) on a single NVIDIA RTX 4070 Ti Super, using <10GB VRAM.
Generalization: The method generalized well to other anomaly types, achieving high Dice scores on Glioma (61.0%) and Stroke (31.6%) datasets.
Ablation Studies:
- Random Projection: Performance stabilized at projection dimension $k \ge 50$ , proving aggressive dimensionality reduction is viable.
- Multi-View: Aggregating views (Axial + Coronal + Sagittal) yielded the best results, though dual-view combinations were nearly as effective.
- Batch Size: The method remains robust even with smaller batch sizes ( $B=15$ ), making it suitable for memory-constrained environments.

5. Significance and Limitations

Significance:
This work establishes a viable, domain-agnostic path for volumetric anomaly detection. By leveraging frozen 2D foundation models and clever tokenization, it eliminates the need for expensive, domain-specific training data or prompt engineering. It bridges the gap between 2D ZSAD success and the 3D medical imaging reality, offering a simple, robust, and computationally feasible solution.

Limitations:

Localization Granularity: The cubic tokenization strategy (aggregating features over fixed regions) inherently limits the detection of very small, sparse, or low-contrast lesions, as the anomaly signal may be diluted by surrounding healthy tissue during averaging.
Scalability: While tokenization reduces complexity, the underlying cross-sample similarity calculation still scales quadratically with the number of samples and tokens, which could limit applicability to extremely high-resolution volumes or massive test cohorts.

Conclusion:
CoDeGraph3D represents a significant step forward in medical AI, proving that training-free, batch-based anomaly detection is not only possible but highly effective for 3D brain MRI, offering a practical alternative to data-hungry supervised models.