Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

Imagine you are a security guard at a museum. Your job is to spot fake paintings or damaged artifacts among thousands of real, perfect ones.

The Old Way: The "Memory Bank" Guard

Most current security guards (AI models) work like this:

They spend months memorizing every single detail of every perfect painting in the museum. They create a giant, heavy "memory bank" containing millions of photos of normal art.
When a new painting arrives, the guard pulls out their giant memory bank and compares the new painting to every single photo they memorized, one by one, to see if it looks different.
The Problem: This is incredibly slow. It takes a lot of energy (computer memory) to carry that giant memory bank, and the comparison process is like searching for a needle in a haystack every time a new painting arrives. Also, they often treat each tiny piece of the painting (a patch) as if it has no relationship to its neighbors, which is unnatural.

The New Way: The "Autoregressive" Guard (This Paper)

The authors of this paper, Ertunc Erdil and his team, proposed a smarter, faster way to be a security guard. Instead of memorizing a giant library of photos, they teach the guard to understand how the painting is put together.

Here is how their new method works, using a simple analogy:

1. The "Sentence" of the Image

Imagine a painting isn't just a collection of random dots, but a sentence.

In a sentence, the word "The" usually comes before "cat," and "cat" usually comes before "sat." You can't just guess the next word without looking at the previous ones.
Similarly, in a medical image (like an MRI of a brain), the texture of the left side of the brain usually tells you what to expect on the right side. They are connected.

The new AI model looks at the image as a sentence. It reads the image from top-left to bottom-right (like reading a book).

2. The "Next-Word" Prediction Game

Instead of memorizing the whole image, the model plays a game: "Given everything I've seen so far, what should the next tiny piece of the image look like?"

Step 1: It looks at the first few pixels.
Step 2: It predicts what the next pixel should be.
Step 3: It checks its prediction. If the actual pixel matches the prediction, everything is "normal."
Step 4: If the actual pixel is totally different from what it predicted (e.g., it predicted "healthy brain tissue" but the image shows a "tumor"), the model screams, "ANOMALY!"

3. The "Dilated" Telescope

The authors noticed a problem: Sometimes, the model gets too lazy. It only looks at the pixel immediately next to the current one to make a guess. This is like reading a book but only looking at the letter right next to the one you are reading. If there is a weird typo three words away, you might miss it.

To fix this, they added "Dilated Convolutions."

Analogy: Imagine the model has a telescope. Instead of just looking at the immediate neighbor, the telescope lets it "skip" a few steps and look at neighbors further away.
This helps the model understand the big picture context. It realizes, "Hey, this patch of tissue doesn't fit with the pattern of the whole organ, even if the immediate neighbor looks okay."

Why is this a Big Deal?

Super Fast (The "One-Pass" Magic):
- Old Way: To check one image, the guard had to search through a massive library (slow!).
- New Way: The guard just reads the image once, from start to finish, making predictions as it goes. It's like reading a book in one sitting. It takes a fraction of the time and uses very little computer memory.
No Giant Memory Banks:
- The model doesn't need to store millions of photos of "normal" images. It just needs to store the rules of how to predict the next piece. This makes it tiny and efficient.
Better at Spotting Fakes:
- Because it understands the relationships between different parts of the image (spatial dependencies), it catches anomalies that other methods miss. It knows that a tumor breaks the "grammar" of the brain's anatomy.

The Results

The team tested this on medical images (brain MRIs, liver CTs, and eye scans).

Speed: Their method was often 10 to 50 times faster than the previous best methods.
Accuracy: It was just as good (or sometimes better) at finding the anomalies.
Efficiency: It ran on standard computer chips without needing massive, expensive supercomputers.

In a Nutshell

Instead of memorizing a giant encyclopedia of what "normal" looks like and then frantically searching through it, this new AI learns the grammar of anatomy. It reads the image like a story, and if the story suddenly makes no sense (an anomaly), it knows immediately. It's faster, lighter, and smarter.

1. Problem Statement

Unsupervised Anomaly Detection (UAD) aims to identify deviations from normal patterns (e.g., lesions in medical imaging) using only healthy training data. While recent foundation models like DINO (Self-Distillation with No Labels) provide rich, context-aware patch embeddings, existing UAD methods utilizing these embeddings face two primary limitations:

Ignored Spatial Structure: Most methods treat patch embeddings as independent samples, modeling their marginal distributions via memory banks or clustering (e.g., DPMM). They fail to explicitly model the spatial and contextual dependencies between neighboring patches, assuming self-attention within the DINO backbone is sufficient.
Computational Inefficiency: State-of-the-art methods often rely on memory banks that store thousands of normal patch features. At inference, they perform costly nearest-neighbor searches to compute anomaly scores. This leads to high memory consumption and slow inference times, which is detrimental for real-world clinical deployment.

2. Methodology

The authors propose a framework that explicitly models the joint spatial distribution of DINOv3 patch embeddings using a 2D Autoregressive (AR) Convolutional Neural Network (CNN).

Core Concept: Autoregressive Factorization

Instead of treating patches independently, the method models the joint probability of the embedding grid $F$ using a raster-scan ordering (row-major traversal):
$p(F) = \prod_{i,j} p(F_{i,j} | F_{<i,j})$
Where $F_{i,j}$ is the embedding at location $(i,j)$ , and $F_{<i,j}$ represents all preceding embeddings in the scan order.

Architecture: Masked Dilated CNN

To implement this efficiently in parallel (avoiding sequential generation), the authors use a Masked CNN architecture:

Masked Convolutions: The network enforces the AR constraint by zeroing out weights that would allow a patch to "see" future embeddings (those to the right or below in the raster scan).
Dilated Convolutions: Since DINO embeddings are already globally contextualized via self-attention, standard CNNs might over-rely on immediate neighbors (short-range interpolation), potentially "hallucinating" anomalies as normal. To mitigate this, the authors introduce dilated convolutions to expand the receptive field without increasing parameters, allowing the model to capture broader spatial dependencies.
Training Objective: The model is trained on healthy samples by minimizing the Negative Log-Likelihood (NLL). The conditional distribution for each patch is modeled as an isotropic Gaussian:
$p(F_{i,j} | F_{<i,j}) = \mathcal{N}(F_{i,j} | \mu_{i,j}, I)$
The network predicts the mean $\mu_{i,j}$ based on the masked context.

Inference

Anomaly detection is performed in a single forward pass. The anomaly score for a patch is simply the negative log-likelihood of its embedding given the preceding context:
$A_{i,j} = -\log p(F_{i,j} | F_{<i,j})$
High scores indicate the patch is unlikely given its spatial context, signaling an anomaly.

3. Key Contributions

Explicit Spatial Modeling: The paper is the first to apply 2D autoregressive modeling directly to DINO patch embeddings, explicitly capturing the joint spatial distribution of the embedding grid rather than treating patches as independent samples.
Memory and Time Efficiency: By replacing memory banks and nearest-neighbor searches with a parametric AR model, the method achieves fast inference (single forward pass) and low memory footprint.
Architectural Insight: The study investigates the impact of receptive field size via dilated convolutions, finding that while they improve performance on structured data (Brain MRI), they are less critical for datasets where local context dominates (Liver CT, Retinal OCT).
State-of-the-Art Performance: The method achieves competitive or superior accuracy compared to complex memory-bank-based methods while being significantly faster.

4. Experimental Results

The method was evaluated on the BMAD benchmark, comprising three medical imaging datasets:

BraTS2021 (Brain MRI)
BTCV+LiTs (Liver CT)
RESC (Retinal OCT)

Key Findings:

Performance:
- BraTS2021: The dilated AR variant achieved 98.35% AUROC and 72.42% AUPR, outperforming all baselines including AnomalyDINO (v3-S) in AUPR and matching it in AUROC.
- BTCV+LiTs: The standard AR variant achieved the highest AUROC (97.32%) among all methods.
- RESC: Achieved competitive results (94.39% AUROC), though slightly below PatchCore.
Efficiency:
- Runtime: The proposed method is drastically faster. On the RESC dataset, it required only 20ms per image, compared to 585ms for AnomalyDINO (v3-S) and 984ms for AnomalyDINO (v2-S).
- Memory: It consumes significantly less GPU memory (approx. 0.2 GB peak) compared to memory-bank methods which can exceed 10 GB or require massive VRAM (e.g., PatchCore failed on BraTS due to >80GB VRAM requirements).
Ablation Studies:
- Dilated vs. Standard: Dilated convolutions significantly improved performance on Brain MRI (structured anatomy) but offered diminishing returns or slight drops on Liver/Retinal datasets, suggesting that the optimal spatial scale depends on the anatomical structure.
- Backbone Scaling: Increasing the DINO backbone from Small (S) to 7B parameters yielded negligible performance gains but significantly increased runtime, suggesting the AR framework is robust even with smaller backbones.

5. Significance

This work bridges the gap between the representational power of foundation models and the efficiency required for clinical deployment.

Paradigm Shift: It moves away from the "store-and-search" paradigm (memory banks) toward a "learned generative" paradigm for UAD, proving that parametric models can effectively capture complex spatial distributions of deep features.
Clinical Viability: The drastic reduction in inference time and memory usage makes this approach highly suitable for real-time applications in resource-constrained environments (e.g., edge devices in hospitals).
Generalizability: The framework is simple, modular, and applicable to any patch-based representation, offering a new direction for efficient unsupervised learning in computer vision.