A Layer-wise Analysis of Supervised Fine-Tuning

Imagine a Large Language Model (like the ones powering chatbots) as a giant, multi-story library.

The Ground Floor (Early Layers): This is where the raw materials are stored. It's full of general knowledge, grammar rules, and basic facts about the world. It's very stable and rarely changes.
The Middle Floors (Middle Layers): This is the "thinking room." It's where the library takes those raw facts and starts connecting them, reasoning, and organizing ideas. It's a busy, flexible, but sturdy place.
The Penthouse (Final Layers): This is the "output stage." It's where the final answer is written down and handed to you. It's very sensitive and changes quickly to match the specific request you just made.

The Problem: The "Renovation Disaster"

When we want to teach this library to follow instructions better (a process called Supervised Fine-Tuning or SFT), we usually renovate the entire building. We hire workers to tweak the ground floor, the middle floors, and the penthouse all at once.

The paper argues that this is a bad idea. It's like trying to fix a leaky faucet in the penthouse by also tearing up the foundation.

The Risk: When you renovate the whole building, you risk accidentally destroying the original blueprints (the "pre-trained knowledge"). This is called Catastrophic Forgetting. The library might forget how to speak English properly just because it's trying to learn how to answer math questions.
The Waste: Most of the workers on the ground floor are just standing around doing nothing useful. They don't need to change.

The Discovery: Where the Magic Happens

The authors of this paper acted like building inspectors. They used special tools to measure exactly what happens to the library's "brain" during this renovation. They looked at three things:

Information Flow: How much data is being compressed?
Geometry: How are the ideas shifting in space?
Weight Changes: How much are the workers actually moving the furniture?

They found a surprising pattern:

The Ground Floor barely moves. It stays the same.
The Penthouse goes crazy. It changes drastically to fit the new instructions, but it's also where the "forgetting" happens.
The Middle Floors (20% to 80% up) are the sweet spot. This is where the model actually learns to follow instructions without forgetting its original knowledge. It's the "Goldilocks zone"—not too rigid, not too chaotic.

The Solution: "Mid-Block Efficient Tuning"

Instead of renovating the whole library, the authors propose a new strategy: Only renovate the Middle Floors.

They call this Mid-Block Efficient Tuning.

How it works: They freeze the ground floor (keep the base knowledge safe) and freeze the penthouse (keep the output style stable). They only let the workers touch the middle 20% to 80% of the building.
The Result: It's like hiring a specialized team just for the "thinking room."
- The library learns to follow instructions much faster.
- It makes fewer mistakes (like hallucinations or forgetting facts).
- It uses less money and energy because fewer workers are needed.

The Analogy in Action

Think of it like teaching a seasoned chef (the Base Model) to cook a new specific dish (Instruction Following).

Old Way (Full Fine-Tuning): You tell the chef to forget everything they know about knives, heat, and ingredients, and start from scratch just to learn this one dish. They might burn the kitchen down or forget how to chop onions.
New Way (Mid-Block Tuning): You tell the chef, "Keep your knife skills and knowledge of heat exactly as they are. Just tweak your plating and recipe adjustments in the middle of the process." The chef learns the new dish perfectly without losing their culinary soul.

Why This Matters

The paper proves that alignment isn't spread evenly throughout the model. It's localized. By finding the specific "middle block" where the magic happens, we can make AI smarter, cheaper to train, and less likely to forget what it already knows.

In short: Don't remodel the whole house to fix the kitchen. Just focus on the kitchen, and leave the foundation alone.

1. Problem Statement

Supervised Fine-Tuning (SFT) is the cornerstone for aligning Large Language Models (LLMs) with human intent. However, it carries a significant risk of catastrophic forgetting, where general capabilities degrade while task-specific alignment improves.

The Gap: While it is known that SFT primarily acts as a "surface-level" adaptation (recalibrating attention and token distributions rather than injecting new knowledge), the spatial distribution of these changes within the model's depth is poorly understood.
The Flaw in Current Methods: Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA apply updates uniformly across all layers. This assumes all layers contribute equally to alignment, potentially wasting computational resources on insensitive layers and exacerbating forgetting in sensitive ones.
Core Question: Where exactly do task adaptation and catastrophic forgetting occur within the model's depth, and can we leverage this to create more efficient tuning strategies?

2. Methodology

The authors conducted a comprehensive, multi-faceted analysis across model scales (1B to 32B parameters, including OLMo2 and Mistral-7B families) using three distinct analytical frameworks:

A. Analytical Frameworks

Optimization Dynamics:
- Measured the magnitude of weight changes ( $\Delta W^{(l)}$ ) using the Frobenius norm between Base and SFT models.
- Hypothesis: Layers closest to the output loss function undergo the most aggressive structural changes to accommodate new constraints.
Information-Theoretic Metrics:
- Entropy: Calculated Prompt Entropy (intra-sequence) and Dataset Entropy (inter-sample) to detect information compression or mode collapse.
- Effective Rank & Deficiency: Analyzed the dimensionality of the representation space to see if SFT forces a compression of pre-trained features into a lower-rank subspace.
- Sparsity: Measured the fraction of inactive neurons to identify feature selection.
Geometric Restructuring:
- CKA (Centered Kernel Alignment) & Cosine Similarity: Measured the similarity between Base and SFT representations to detect structural divergence.
- Mean Shift & Curvature: Tracked the physical displacement of representation centroids and the smoothness of reasoning paths.

B. Experimental Validation

Layer-wise Probing: Used intermediate layer outputs to predict the next token, testing where task-specific capabilities emerge.
Layer Swapping: Replaced specific blocks of layers between Base and SFT models to establish causal links between layer groups and performance.
Proposed Method (Mid-Block Efficient Tuning): Based on findings, they proposed selectively updating only the intermediate layers (20%-80% depth) using LoRA, keeping edge layers frozen.

3. Key Findings & Results

A. Depth-Dependent Adaptation Pattern

The analysis revealed a consistent, non-uniform pattern across all model scales:

Top Layers (Final 20%): Exhibit high sensitivity and aggressive plasticity.
- Representations diverge sharply from the Base model (CKA drops, Mean Shift spikes).
- Weight updates are most intense here.
- This is identified as the primary locus of catastrophic forgetting, where new information overwrites pre-existing features.
Middle Layers (20% - 80%): Exhibit stability and robust integration.
- Representations remain highly similar between Base and SFT models.
- Effective rank is high and stable, acting as a "substrate" for memory consolidation.
- Task adaptation capabilities (measured by probing accuracy) emerge strongly in these layers before the final output layer.
Bottom Layers: Act largely as frozen feature extractors with minimal weight changes.

B. Performance of Mid-Block Efficient Tuning

The authors proposed Mid-Block Efficient Tuning, which applies LoRA updates only to the stable intermediate layers (20%-80% depth).

GSM8K Results:
- OLMo2-7B: Achieved 37.5% accuracy (vs. 28% for standard full-layer LoRA), a ~10.2% improvement.
- OLMo2-32B: Achieved 32% accuracy (vs. 29% baseline) with fewer trainable parameters.
- OLMo2-13B: Achieved 30% accuracy (vs. 27% baseline).
Edge Layer Failure: Tuning only the bottom (10000) or top (00001) layers resulted in significant performance degradation, confirming that effective alignment is architecturally localized rather than distributed.
Generalization: The pattern held true across OLMo2 (1B, 7B, 13B, 32B) and Mistral-7B, and across diverse tasks (Math, Reasoning, Instruction Following).

4. Key Contributions

Mechanistic Insight: Provided the first comprehensive evidence that SFT-induced alignment is depth-dependent, with the top layers driving forgetting and middle layers driving integration.
New Tuning Strategy: Introduced Mid-Block Efficient Tuning, a simple yet highly effective PEFT strategy that outperforms standard LoRA by focusing updates on the "sweet spot" of the network.
Efficiency: Demonstrated that effective alignment can be achieved with reduced parameter overhead by ignoring insensitive edge layers.
Theoretical Framework: Established a rigorous methodology combining information theory, geometry, and optimization metrics to analyze LLM internal dynamics.

5. Significance

Redefining Alignment: The paper challenges the assumption that all layers contribute equally to alignment. It suggests that alignment is a localized phenomenon, shifting the paradigm from "uniform adaptation" to "architecturally localized adaptation."
Mitigating Catastrophic Forgetting: By identifying the top layers as the source of forgetting and the middle layers as the stable substrate, the proposed method offers a pathway to align models without destroying their pre-trained general knowledge.
Practical Efficiency: The method offers a significant performance boost (up to 10% on reasoning tasks) with fewer trainable parameters, making high-performance alignment more accessible and computationally efficient.
Future Directions: The work lays the groundwork for adaptive, training-free metrics to automatically identify optimal tuning boundaries for different architectures and alignment stages (e.g., RLHF).

In summary, the paper argues that effective alignment is not distributed but localized, and by targeting the stable intermediate layers of a Transformer, we can achieve superior performance with greater efficiency and reduced risk of catastrophic forgetting.