The Big Picture: A New Room in an Old House

Imagine a massive, highly intelligent library (the Base Model) that already knows how to write, code, and reason. This library has a specific way of organizing its books and thoughts, which researchers call its "internal geometry."

Now, imagine you want to teach this library a new skill, like writing in a specific style or following new safety rules. Instead of rebuilding the whole library, you add a small, temporary annex to it. This is LoRA (Low-Rank Adaptation). It's a lightweight "adapter" that sits on top of the original library to tweak its behavior without changing the original books.

The Problem: We know the annex changes what the library says, but we don't really know how it changes the library's internal thinking. Does the annex just rearrange the existing books, or does it build a completely new, invisible wing that the original library's map doesn't show?

The Experiment: The "Delta" Detective

The researchers wanted to see exactly what this annex (the LoRA adapter) was doing inside the library's brain.

The "Before and After" Photo: They took a snapshot of the library's thoughts before adding the annex ( $h_{base}$ ) and another snapshot after adding it ( $h_{adapted}$ ).
The "Difference" ( $h_\Delta$ ): They subtracted the "before" photo from the "after" photo. The result, called the Delta, is the pure "ghost" of the adapter. It shows only what the new annex added, stripping away everything the original library already knew.
The Translator (Sparse Autoencoder): To understand this "ghost," they used a special tool called a Sparse Autoencoder (SAE). Think of an SAE as a translator that tries to describe complex thoughts using a specific dictionary of simple, clear concepts (like "happiness," "math," or "danger").

The Discovery: Two Different Languages

The researchers trained their translator on two different things:

Dictionary A: The original library's existing concepts (Pre-trained SAE).
Dictionary B: A new dictionary trained specifically on the "ghost" of the annex (Delta SAE).

Here is what they found:

1. The Translator Failed with the Old Dictionary

When they tried to describe the annex's thoughts using the original library's dictionary, the translator failed miserably.

The Analogy: Imagine trying to describe a new type of alien fruit using only words for apples and oranges. You can't do it. The "error" was so high that the translator couldn't even capture the shape of the fruit.
The Result: The original dictionary was blind to the new features the adapter created.

2. The New Dictionary Worked Perfectly

When they used the new dictionary (trained specifically on the annex), it described the thoughts perfectly.

The Analogy: They realized the annex was speaking a slightly different dialect. Once they learned that specific dialect, everything made sense.
The Result: The adapter creates its own unique "feature space" that is geometrically distinct from the original model.

3. The "Ghost" Lives in a Different Room

The researchers measured the angle between the original library's thoughts and the adapter's thoughts.

The Analogy: If the original library's thoughts were pointing North, the adapter's thoughts were pointing almost directly West (about 74 degrees apart). They are not just slightly different; they are operating in a completely different direction.
The Result: No matter how big or small the adapter was (changing the "rank" or size of the annex), it always built this separate, distinct room.

Why This Matters (According to the Paper)

The paper highlights a specific "monitoring gap" regarding safety:

The Blind Spot: If you train a safety filter on the original library (the base model) and then attach a safety adapter (LoRA), the safety tools might be looking at the wrong map. They are checking the original library's "North," while the adapter is operating in "West."
The Risk: Because the adapter's internal changes are so different from the base model, standard safety checks might miss dangerous behaviors that the adapter introduces. The adapter is effectively hiding in a room the safety inspectors can't see.

Summary of Key Findings

LoRA isn't just a tweak; it's a new structure. It creates features that the original model's dictionary cannot see.
Size doesn't change the direction. Whether the adapter is small or large, it always builds this separate, distinct "room."
We need new maps. To understand or audit these adapted models, we can't just use the tools built for the original model. We need to build new tools (like the "Delta SAE") that specifically look at what the adapter adds.

In short: The adapter doesn't just rearrange the furniture in the original house; it builds a new, invisible wing that requires its own unique blueprint to understand.

Technical Summary: Feature Geometry of LoRA Adapters

Problem Statement

While Low-Rank Adaptation (LoRA) is the dominant method for fine-tuning Large Language Models (LLMs), the internal representational changes it induces remain poorly understood. Existing mechanistic interpretability tools, specifically Sparse Autoencoders (SAEs), have been successfully applied to base models and RLHF-tuned variants to decompose residual stream activations into sparse, monosemantic features. However, these tools are typically applied to the full adapted model output, conflating base model representations with adapter-specific contributions.

This lack of granularity creates a critical gap: if LoRA adapters operate in representational subspaces that base-model interpretability tools cannot "see," safety audits and alignment analyses of fine-tuned models may be systematically incomplete. Furthermore, the mechanistic reasons why safety fine-tuning can be easily undone by subsequent adaptation remain unexplored at the feature level.

Methodology: The Delta SAE Framework

To isolate the specific contribution of LoRA adapters, the authors introduce a Delta Activation Framework. Instead of analyzing the full adapted activation ( $h_{adapted}$ ), the study focuses on the activation delta:
$h_\Delta = h_{adapted} - h_{base} = \frac{\alpha}{\sqrt{r}} BAx$
This delta represents the exact, mechanistically clean contribution of the adapter, free from the base model's signal.

The experimental pipeline involves:

Model Setup: Using Gemma-2-9B as the base model. Four LoRA adapters were trained with ranks $r \in \{4, 8, 16, 32\}$ on the Alpaca dataset (10,000 samples), with all other hyperparameters fixed to isolate rank as the variable.
Delta Extraction: Forward hooks captured residual stream activations at six target layers (5, 10, 18, 22, 32, 38) for both base and adapted models to compute $h_\Delta$ .
Delta SAE Training: Dedicated SAEs were trained exclusively on the normalized $h_\Delta$ vectors for each (rank, layer) pair. These were compared against pre-trained Gemma Scope SAEs (trained on the base model's residual stream).
Geometric Analysis: Three complementary measures were used to evaluate alignment between the adapter-induced features and the base model features:
- Cosine Similarity: Maximum similarity between delta SAE decoder directions and Gemma Scope feature directions.
- Principal Angle Analysis: Angles between the top-256 dimensional subspaces of the delta SAE and Gemma Scope decoder matrices.
- Centered Kernel Alignment (CKA): Measuring representational similarity between $h_{base}$ and $h_\Delta$ activation sets.

Key Results

1. Failure of Base SAEs to Reconstruct Adapter Signals

When Gemma Scope (base model) SAEs were used to reconstruct $h_\Delta$ , the relative reconstruction error exceeded 1.0 across all layers and ranks. This indicates that the approximation error of the base dictionary is larger than the signal magnitude of the adapter itself. The error was most severe in early layers (Layer 5, $\epsilon \approx 2.3$ ) and improved slightly with depth, but remained high.

2. Superiority of Adapter-Specific SAEs

SAEs trained specifically on $h_\Delta$ significantly outperformed the base SAEs on held-out data. Reconstruction improvements ranged from 46.3% to 86.2%, demonstrating that LoRA adapters learn genuine, generalizable structures that are not captured by the base model's feature dictionary.

3. Geometric Divergence

Three independent analyses confirmed that LoRA features occupy a geometrically distinct subspace:

Cosine Similarity: The mean maximum cosine similarity between delta features and base features was ~0.071, barely above the expected value for random vectors in 3,584 dimensions (~0). Only 0.01–0.02% of delta features showed strong alignment (>0.7) with base features.
Principal Angles: The mean principal angle between the subspaces was ~74°, with 0% of directions showing alignment (<20°). Approximately 66% of the subspace was near-orthogonal (>70°).
CKA: The CKA between $h_{base}$ and $h_\Delta$ was lowest at Layer 18 (the semantic processing layer), dropping to ~0.05–0.08, indicating maximum representational divergence where semantic processing is concentrated.

4. Rank and Depth Effects

Feature Density: The number of active features per token increased monotonically with both layer depth and LoRA rank. For example, at Layer 38, rank 4 activated ~30 features/token, while rank 32 activated ~41.
Geometric Stability: Despite changes in density and capacity, the fundamental geometric novelty (measured by principal angles and cosine similarity) remained rank-invariant. All ranks produced representations that were geometrically separated from the base model.
Weakly Aligned Features: Over 93% of features activated by $h_\Delta$ were "weakly aligned" (active only on the delta, not the base), a fraction that remained consistent across all ranks and layers.

Significance and Claims

The paper claims to provide the first systematic mechanistic analysis of LoRA feature geometry. The primary contribution is the identification of a "monitoring gap": interpretability tools trained solely on base model activations are systematically blind to the representational contributions of LoRA adapters.

The authors argue that:

Safety Audits are Incomplete: If an organization deploys a safety-fine-tuned LoRA model, standard SAE-based audits may fail to detect adapter-encoded representations because the base dictionary cannot reconstruct the delta signal.
Mechanistic Explanation for Fragility: The geometric separation offers a mechanistic account for why safety fine-tuning can be easily undone; subsequent fine-tuning may simply shift the model into a distinct subspace that the original safety constraints (encoded in the base geometry) do not effectively monitor.
Methodological Solution: The Delta SAE framework is proposed as a necessary tool for feature-level auditing of fine-tuned models, enabling the isolation and analysis of adapter-specific contributions.

The study concludes that while LoRA adapters increase representational capacity (density) with higher ranks, they fundamentally operate in a distinct geometric subspace, necessitating new interpretability approaches for fine-tuned models.

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models