Sparse Crosscoders for diffing MoEs and Dense models

The Big Picture: Two Kitchens, One Recipe

Imagine you are trying to understand how two different restaurants cook the exact same dish (let's say, a complex lasagna).

Restaurant A (The Dense Model): This is a traditional kitchen. Every time an order comes in, every single chef in the kitchen gets to work on it. They all chop, sauté, and bake together. It's powerful, but it takes a lot of energy and resources.
Restaurant B (The MoE Model): This is a modern, high-tech kitchen. It has a huge staff of 1,000 specialized chefs (experts), but for every single order, the manager only picks two or three specific chefs to work on it. The rest of the kitchen stays idle. This is much more efficient, but because the chefs are so specialized, it's harder to understand exactly what they are doing inside their heads.

The Problem: We know both restaurants make great lasagna, but we don't really know how their internal thinking processes differ. Do the specialized chefs in Restaurant B think differently than the general chefs in Restaurant A?

The Tool: The "Universal Translator" (Crosscoders)

To solve this, the researchers built a special tool called a Crosscoder.

Think of a Crosscoder as a Universal Translator or a Shared Notebook.

Instead of trying to read the chefs' minds directly (which is messy), the researchers feed the same ingredients (text data) into both kitchens.
They watch what happens in the middle of the cooking process (the "activations").
The Crosscoder tries to find a common language that can describe what is happening in both kitchens simultaneously.

It asks: "Is there a concept here that both kitchens use? And are there concepts that only one kitchen uses?"

The Experiment: What They Found

The researchers trained both kitchens on a massive library of books, code, and stories. Then, they used their Universal Translator to compare the "thoughts" of the two kitchens. Here is what they discovered:

1. The "Specialist" vs. The "Generalist"

The Dense Kitchen (Generalist): This kitchen developed a huge variety of unique tools and techniques. They have thousands of different ways to handle specific details. Their "thoughts" are spread out across many different, broad concepts.
The MoE Kitchen (Specialist): This kitchen learned far fewer unique tricks. Instead of having a tool for every tiny detail, they developed a few highly specialized, laser-focused tools.
- Analogy: The Dense kitchen has a drawer with 10,000 different screwdrivers, each for a slightly different screw. The MoE kitchen has only 500 screwdrivers, but each one is a master tool that does a very specific job perfectly.

2. How Often They Use Their Tools (Activation Density)

MoE Features: When the MoE kitchen uses its unique, specialized tools, they use them very frequently and intensely. It's like a master chef who grabs their favorite knife and uses it for almost every chop.
Dense Features: The Dense kitchen's unique tools are used more rarely. They have so many tools that they only pull out a specific one when absolutely necessary.

3. The "Shared" Language

The researchers found that both kitchens share a lot of basic vocabulary (about 87% of the "thoughts" could be explained by shared concepts). However, the MoE kitchen is much more efficient at organizing its unique thoughts. It doesn't spread its information around as loosely as the Dense kitchen; it packs it into tight, focused bundles.

Why This Matters

Before this study, we knew MoE models were faster and cheaper to run. But we didn't know why they worked so well internally.

This paper tells us that MoE models aren't just "smaller" versions of dense models. They are fundamentally different. They act like a team of hyper-specialized experts who communicate in a very focused way, whereas dense models act like a large team of generalists who spread the work out.

The Takeaway

If you want to build a super-efficient AI, you don't need to make it "think" like a human with a million scattered thoughts. You can build it like a specialized task force: a smaller group of experts who know exactly what to do, when to do it, and how to do it with extreme precision.

The researchers also noted that their "Universal Translator" (Crosscoder) needed some tweaking to work on these two very different types of kitchens, suggesting that we need new tools to fully understand these next-generation AI architectures.

1. Problem Statement

While Mixture of Experts (MoE) architectures have become standard for scaling Large Language Models (LLMs) due to their parameter efficiency, their internal representational mechanisms remain poorly understood compared to dense models.

The Gap: Existing interpretability research (e.g., using Sparse Autoencoders) is well-established for dense models but has not been systematically applied to compare MoEs against dense counterparts.
Key Questions:
- Do MoE experts develop distinct feature representations compared to dense layers?
- How does sparse routing influence feature specialization and diversity?
- Do existing intuitions about dense model internals hold true for MoEs?
Challenge: Standard crosscoders (which jointly model two activation spaces) tend to overestimate shared structure when comparing structurally different models (trained from scratch) versus fine-tuned variants, leading to poor interpretability.

2. Methodology

The authors employed a systematic experimental design involving model training, crosscoder adaptation, and comparative analysis.

A. Model Training

Architectures: A 5-layer Dense model and a 5-layer MoE model were trained.
Constraints: Both models were trained with matched active parameters to ensure a fair comparison of representational capacity.
Dataset: ~1 billion tokens comprising three domains:
- ArXiv scientific text (RedPajama).
- Code (StarCoder).
- English stories (SimpleStories).
Training: Both models were trained for 2 epochs using Cross Entropy loss. The MoE utilized an additional Switch load balancing loss.

B. Crosscoder Adaptation

To compare the internal representations, the authors trained a Crosscoder on the third-layer activations of both models.

Standard Crosscoder: Learns shared sparse features $f_i(x)$ to reconstruct activations for both Model A (Dense) and Model B (MoE) using model-specific decoder weights.
BatchTopK Variant: Replaced continuous $L_1$ penalties with a hard sparsity constraint, selecting only the top $K$ activations per batch to enforce a fixed sparsity budget.
Fixed Shared-Feature Variant: To address the issue of over-estimating shared structure, the authors explicitly designated a subset of features $S$ $S$ as "shared" (tied decoder parameters) and the rest $F$ $F$ as "exclusive."
- Hyperparameter Tuning: Prior work suggested a sparsity penalty ratio $\lambda_s/\lambda_f \approx 0.1\text{--}0.2$ . However, the authors found this ineffective for their setting (independent training). They determined a higher ratio of $\approx 0.7$ was necessary to effectively distinguish model-specific features.

C. Feature Classification Metric

To quantify whether a feature is shared or model-specific, they defined a metric based on the relative difference of decoder latent norms ( $\Delta_{norm}$ ):
$\Delta_{norm}(i) = \frac{1}{2} \left( \frac{\|W^{dense}_i\|^2 - \|W^{MoE}_i\|^2}{\max(\|W^{dense}_i\|^2, \|W^{MoE}_i\|^2)} + 1 \right)$

$\Delta_{norm} \approx 0.5$ : Feature is equally shared.
$\Delta_{norm} \approx 0$ : Feature is exclusive to MoE.
$\Delta_{norm} \approx 1$ : Feature is exclusive to Dense.

3. Key Results

A. Reconstruction Performance

The optimized BatchTopK crosscoder with fixed shared features achieved a ~87% fractional variance explained across 40k training steps, validating the approach's ability to capture internal activations.

B. Feature Distribution and Count

Unique Features: The Dense model learned significantly more unique (exclusive) features than the MoE.
- Dense-only features: 3,226
- MoE-only features: 910
- Shared features: 18,940
Conclusion: MoEs learn fewer but more specialized features compared to the broader feature set of dense models.

C. Activation Density Patterns

The study revealed a distinct density pattern that contrasts with previous findings on fine-tuned models:

MoE-specific features: Exhibit higher activation density than shared features.
Dense-specific features: Exhibit lower activation density than shared features.
Interpretation: This suggests MoEs concentrate information into highly active, specialized "expert" features, whereas dense models distribute information more broadly across general-purpose features.

D. Structural Observations

Cosine Similarity: While budgeted shared features showed high cosine similarity ( $\sim 1$ ), other "shared" features (in the $\Delta_{norm}$ range 0.3–0.7) did not show high similarity; some even exhibited opposite directions ( $\approx -1$ ).
Distribution Shape: Unlike the trimodal distribution typically seen when diffing base and fine-tuned models, the feature distribution here did not show a clear trimodal structure, indicating a more complex relationship between the two architectures.

4. Key Contributions

Systematic Comparison: First systematic application of crosscoders to compare MoE and dense model internals trained from scratch with matched active parameters.
Methodological Refinement: Identified that standard crosscoder hyperparameters (specifically the shared-feature sparsity ratio) fail for structurally distinct models and proposed a new ratio ( $\approx 0.7$ ) to handle independent training divergence.
Architectural Insights: Demonstrated that MoEs develop more specialized, focused representations (fewer unique features, higher density) while dense models utilize broader, general-purpose features.
Tool Extension: Proved that crosscoders can be extended beyond fine-tuning analysis to understand fundamental architectural differences, though further work is needed to handle structural divergence.

5. Significance

This work provides a foundational step toward mechanistic interpretability in sparse architectures. By revealing that MoEs organize information differently—favoring localized specialization over broad distribution—it challenges the assumption that MoEs are merely "efficient dense models." These findings offer a new lens for understanding how MoEs scale and how their internal "experts" function, paving the way for future research into optimizing routing strategies and improving the interpretability of state-of-the-art sparse LLMs.