Layer by layer, module by module: Choose both for optimal OOD probing of ViT

Imagine you have a brilliant, world-class chef (the Vision Transformer) who was trained in a massive, high-end kitchen using the finest, most diverse ingredients imaginable (Pretraining on ImageNet). This chef can cook almost anything.

Now, you want to hire this chef to work in a specific, smaller restaurant. Sometimes, the restaurant serves the exact same fancy cuisine the chef knows (In-Distribution). Other times, the restaurant is in a different country with weird, local ingredients, or the power is flickering, or the customers have very strange tastes (Out-of-Distribution or OOD).

This paper is about figuring out where in the chef's cooking process you should taste the food to get the best result, depending on how different the new restaurant is from the chef's original training.

Here is the breakdown of their discovery:

1. The "Final Dish" vs. The "Mid-Cooking" Taste

Usually, when we use AI, we look at the very last step of the process—the final dish served on the plate. We assume that's where all the magic happens.

The Old Belief: "The final layer is always the best."
The New Discovery: If the new restaurant is very different from the chef's training (a big Distribution Shift), the final dish often tastes terrible. The chef gets confused by the weird ingredients and messes up the final seasoning.
The Fix: It turns out, if you taste the food mid-cooking (at the intermediate layers), it's often much more reliable. The chef's early instincts are still sharp, even if the final presentation gets messed up by the weird new environment.

Analogy: Think of a student taking a test.

Final Layer: The final answer written on the paper.
Intermediate Layer: The notes and scratch work done in the middle of the exam.
If the test is exactly what they studied for, the final answer is perfect. But if the test is a surprise with weird questions, the student might panic at the end and write the wrong final answer. However, their understanding of the concepts (the middle notes) might still be solid.

2. The "Kitchen Stations" (Modules)

Inside the chef's kitchen (the Transformer block), there are different stations:

The Attention Station (MHA): Where the chef looks at all the ingredients and decides which ones are important.
The Feed-Forward Station (FFN): Where the chef actually chops, mixes, and cooks the ingredients. This station has two steps: Chopping (FC1), Cooking/Activating (Act), and Plating/Compressing (FC2).

The paper found that not all stations are created equal when things get messy:

The "Plating" Station (FC2): This is the worst place to taste if things are going wrong. It's where the chef tries to squeeze everything into a neat, final package. If the ingredients are weird, this squeezing process destroys the flavor.
The "Cooking" Station (Act): This is the hero. When the ingredients are weird (high distribution shift), tasting the food right after the "cooking" step (the activation) gives you the most accurate flavor profile. It captures the essence of the ingredients before they get messed up by the final packaging.
The "Pre-Prep" Station (LN2): If the new restaurant is actually quite similar to the old one (low shift), then the standard "final dish" (or the layer right before the cooking starts) is fine.

3. The Big Takeaway: "Layer by Layer, Module by Module"

The authors give us two simple rules of thumb for using these AI chefs:

If the new job is familiar (In-Distribution): Stick to the Final Layer. The chef is an expert, and the final dish is perfect.
If the new job is weird or risky (Out-of-Distribution):
- Don't look at the final dish.
- Don't look at the very beginning.
- Look at the "Cooking" step (the Activation) in the middle layers. This is where the AI is most honest and least confused.

Why does this matter?

In the real world, AI models often face "drift"—the data they see changes over time (e.g., a self-driving car seeing snow when it was trained on sunny days, or a medical AI seeing a new type of virus).

If we blindly trust the "Final Layer," the AI might fail silently. But if we know to check the "Intermediate Cooking Layer," we can build systems that are much more robust and reliable, even when the world changes around them.

In a nutshell: When the world gets weird, don't wait for the final answer. Check the work-in-progress; that's where the truth is hiding.

Here is a detailed technical summary of the paper "Layer by Layer, Module by Module: Choose Both for Optimal OOD Probing of ViT".

1. Problem Statement

Foundation models, particularly Vision Transformers (ViTs), are typically pretrained on massive datasets (e.g., ImageNet-21k) and adapted to downstream tasks. A critical challenge arises when these models face Out-of-Distribution (OOD) data, where the distribution of the downstream task differs significantly from the pretraining data.

The Conundrum: Recent studies (e.g., Skean et al., 2025) suggested that intermediate layers of foundation models often yield better representations than the final layer, attributing this to autoregressive pretraining objectives. However, other works contradict this, and the consensus for standard ViTs (trained via supervised or self-supervised objectives) remains unclear.
The Gap: It is unknown whether the superiority of intermediate layers is a byproduct of the pretraining objective (autoregression) or a consequence of distribution shift. Furthermore, standard probing practices typically extract features only from the final output of a transformer block (Residual Connection 2, or RC2), ignoring the internal states of the block's components (Attention, Feed-Forward, Normalization).

2. Methodology

The authors conducted a comprehensive empirical study using an 86M-parameter ViT pretrained on ImageNet-21k.

Experimental Setup:
- Benchmarks: 11 diverse classification datasets, including in-distribution (ID) sets (Cifar10, Cifar100, Flowers102, Pets) and OOD sets (Cifar10-C variants like Contrast, Gaussian Noise, Motion Blur, Snow, Speckle Noise; and DomainNet variants Clipart, Sketch).
- Probing Protocol: Linear probing (logistic regression with L-BFGS) on frozen representations.
- Granularity: The study analyzed representations at two levels:
  1. Layer Level: Probing the output of every transformer block (RC2).
  2. Module Level: Probing the internal activations of specific components within each block:
    - LN1 (LayerNorm before Attention)
    - MHA (Multi-Head Attention output)
    - RC1 (Residual connection after Attention)
    - LN2 (LayerNorm before Feed-Forward)
    - FC1 (First fully connected layer of FFN)
    - Act (Activation function, e.g., GeLU, output of FC1)
    - FC2 (Second fully connected layer of FFN)
    - RC2 (Standard block output)
Comparative Analysis: The authors compared the performance of frozen models (linear probing only) against finetuned models to isolate the effect of distribution shift.

3. Key Findings & Results

A. Distribution Shift is the Primary Driver

ID vs. OOD: In In-Distribution (ID) settings, the final layer consistently yields the best performance. However, as the distribution shift increases (moving from ID to severe OOD), the performance of the final layers degrades significantly.
Intermediate Robustness: In OOD settings, intermediate layers outperform the final layers. The deeper the network, the more specialized the features become for the pretraining distribution, making them less robust to drift.
Conclusion: The benefit of intermediate layers is not merely a byproduct of autoregressive pretraining but a direct consequence of distribution shift.

B. Module-Level Analysis (The "Module by Module" Insight)

The authors found that probing the standard block output (RC2) is suboptimal for OOD tasks. The optimal module depends on the severity of the shift:

Under Significant Distribution Shift (Strong OOD):
- Best Module: Act (the activation output of the first FC layer, i.e., $FC1 \rightarrow \text{GeLU}$ ).
- Worst Module: FC2 (the second FC layer).
- Reasoning: FC1 expands the dimensionality ( $d \to 4d$ ), and the subsequent activation helps in feature disentanglement and filtering noise. FC2 compresses the dimension back ($4d \to d$), which may destroy linear separability and discard semantic information useful for the downstream task.
- Performance: Act significantly outperforms RC2 and other modules in high-shift scenarios (e.g., Sketch, Clipart, Speckle Noise).
Under Weak/No Distribution Shift (ID):
- Best Module: LN2 (LayerNorm before the FFN) or RC2.
- Reasoning: When the data distribution matches the pretraining, the standard block output (RC2) or the normalized input to the FFN (LN2) retains the most refined semantic information.
Stability:
- LN2 and RC2 provide more stable performance across varying depths compared to the volatile performance of FC2 and Act in the final layers.

4. Key Contributions

Reframing the "Intermediate Layer" Phenomenon: The paper establishes that the superiority of intermediate layers in ViTs is driven by distribution shift, not just the pretraining objective (autoregression vs. supervised).
Fine-Grained Module Probing: It introduces a novel analysis of probing specific transformer components rather than just block outputs. It identifies that probing the activation (Act) of the intermediate Feed-Forward Network is the optimal strategy for OOD robustness.
Actionable Guidelines:
- For ID tasks: Probe the final layer (RC2).
- For OOD tasks: Probe the Act (activation) of intermediate layers.
- Safe Default: If the shift is unknown, probing LN2 is a safer alternative to the standard RC2, as it offers a balance between stability and performance.

5. Significance

Practical Impact: This work provides a concrete, low-cost strategy (linear probing) to improve the reliability of foundation models in real-world scenarios where distribution shifts are inevitable. It suggests that simply changing where and what to probe within a frozen model can yield significant accuracy gains without retraining.
Theoretical Insight: It challenges the assumption that the final layer is always the "most informative." It highlights the role of the Feed-Forward Network's expansion phase (FC1 + Act) in preserving semantic information against noise and distribution drift, offering new directions for understanding transformer internal representations (e.g., via information-theoretic or geometric measures).
Future Directions: The findings suggest that future OOD detection and adaptation methods should focus on intermediate activations and specific module outputs rather than relying solely on the final token representation.

Layer by layer, module by module: Choose both for optimal OOD probing of ViT

1. The "Final Dish" vs. The "Mid-Cooking" Taste

2. The "Kitchen Stations" (Modules)

3. The Big Takeaway: "Layer by Layer, Module by Module"

Why does this matter?

1. Problem Statement

2. Methodology

3. Key Findings & Results

A. Distribution Shift is the Primary Driver

B. Module-Level Analysis (The "Module by Module" Insight)

4. Key Contributions

5. Significance

More like this

Fairness-Aware Multi-Group Target Detection in Online Discussion

Accounting for shared covariates in semi-parametric Bayesian additive regression trees

On the Impact of Sampling on Deep Sequential State Estimation

DKDL-Net: A Lightweight Bearing Fault Detection Model via Decoupled Knowledge Distillation and Low-Rank Adaptation Fine-tuning

The Z-Gromov-Wasserstein Distance