SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Imagine you have two friends, Alex and Jamie. Both of them are experts at describing a picture of a cat.

Alex speaks a language where the word "fluffy" means "cat."
Jamie speaks a language where the word "whiskers" means "cat."

If you ask them to describe a photo of a cat, they both give you a great description. But if you try to compare their notes, it's a mess. You can't easily tell that they are talking about the same thing because their internal dictionaries are completely different. This is exactly the problem with modern AI models.

The Problem: AI's "Tower of Babel"

Today, we have many powerful AI models (like DINO for vision and CLIP for understanding images and text). They are all smart, but they "think" in their own isolated languages.

Model A might have a specific neuron that lights up for "dogs."
Model B might have a totally different neuron for "dogs," and it might use a third neuron for "cats."

Because they don't share a common language, it's incredibly hard to compare them, check if they are biased, or combine their strengths. It's like trying to build a bridge between two islands that have no common ground.

The Solution: SPARC (The Universal Translator)

The researchers behind this paper created a new tool called SPARC. Think of SPARC as a Universal Translator or a Shared Notebook that forces these different AI models to agree on a single, common language.

Here is how it works, using a simple analogy:

1. The "Global TopK" Rule (The Strict Teacher)

Imagine a classroom with 100 students (the AI models). Usually, if you ask a question, Student A might raise their hand to answer, while Student B stays silent, and Student C raises a different hand. They are all answering, but not in sync.

SPARC introduces a strict rule called Global TopK.

The teacher (SPARC) asks a question about a "cat."
Instead of letting each student pick their own hand to raise, the teacher looks at all the students together and says, "Okay, for the concept of 'cat,' everyone must raise Hand #42."
If Student A tries to raise Hand #43, they are told to stop. If Student B tries to raise Hand #42, they are encouraged to do so.

Why this matters: This forces every model to use the exact same "switch" (neuron) for the same concept. Now, if you see Hand #42 go up, you know everyone is talking about a cat, no matter which model you are looking at.

2. The "Cross-Reconstruction" Game (The Translation Drill)

Now that they are all using the same switches, SPARC plays a game to make sure they actually mean the same thing.

It takes the "cat" signal from Model A and asks Model B to rebuild the picture of the cat using that signal.
Then it takes the signal from Model B and asks Model A to rebuild it.

If Model A's signal for "cat" is actually about "dogs," Model B will fail to rebuild the cat picture. This forces the models to align their meanings, not just their switches. They have to agree on what "cat" actually looks like.

What Can We Do With This?

Once SPARC has taught all these models to speak the same language, some amazing things happen:

Spotting Bias Instantly: If you want to know if an AI is racist or sexist, you don't have to check every single model one by one. You just check the "Shared Notebook." If the "bad concept" is there, you know it's in all the models.
Text-to-Space Magic: You can type "Find the cat" into a system that only understands images (like a security camera), and because the systems now share a language, the camera can instantly point to where the cat is, even though it was never taught to understand text directly.
Better Search: You can search for a picture using a text description, and the system will find the perfect match because the text and the image are now speaking the same dialect.

The Result

The paper shows that SPARC is a massive improvement. Before, different models agreed on concepts only about 22% of the time (like two people guessing the same word by chance). With SPARC, they agree 80% of the time.

In a Nutshell

SPARC is like building a common operating system for AI. Instead of every AI model running its own isolated software, SPARC installs a shared interface where "Dog," "Car," and "Sunset" mean the exact same thing to everyone. This makes AI more transparent, easier to debug, and much more powerful when we try to use different models together.

1. Problem Statement

Current AI interpretability methods, particularly Sparse Autoencoders (SAEs), typically analyze models in isolation. While effective at finding monosemantic features (latent dimensions representing single concepts) within a single model, they produce isolated concept spaces.

The Challenge: Different architectures (e.g., DINO vs. CLIP) and modalities (vision vs. text) learn different latent representations for the same underlying concept (e.g., "cat").
Limitations of Prior Work: Existing cross-model methods like Universal Sparse Autoencoders (USAE) attempt to learn a shared dictionary but suffer from:
- Training Instability: They rely on random encoder selection per iteration, leading to uneven concept learning.
- Lack of Structural Alignment: They use "soft" alignment via reconstruction, allowing different models to activate different subsets of latents for the same input, resulting in "dead" neurons in some streams and inconsistent concept mapping.
- Evaluation Gap: It is difficult to verify if statistically aligned activations truly represent the same semantic concept across models.

2. Methodology: SPARC Framework

The authors propose SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a framework designed to learn a single, unified latent space shared across heterogeneous models and modalities.

Core Architecture

SPARC processes multiple input streams ( $S = \{s_1, ..., s_M\}$ ), such as features from DINO, CLIP-Image, and CLIP-Text. Each stream has a specific encoder ( $E_s$ ) and decoder ( $D_s$ ), but they share a common latent space dimension $L$ .

Key Innovations

SPARC introduces two critical mechanisms to enforce alignment:

Global TopK Sparsity Mechanism:
- Standard Approach: Typically, each stream independently selects its top- $k$ active logits (Local TopK). This leads to different active indices for different models processing the same image.
- SPARC Approach:
  1. Compute logits ( $h_s$ ) for all streams.
  2. Aggregate these logits across all streams: $h_{agg} = \sum h_s$ .
  3. Select the TopK indices ( $I_{global}$ ) from the aggregated logits.
  4. Apply this same index set to all streams.
- Effect: This enforces a "hard structural constraint" where a latent dimension either activates for all streams or remains inactive for all streams. This eliminates "dead" neurons in specific streams and ensures that dimension $j$ represents the same concept across all models.
Cross-Reconstruction Loss:
- Objective: In addition to self-reconstruction ( $x_s \to \hat{x}_s$ ), SPARC minimizes the error of reconstructing stream $t$ 's input using stream $s$ 's latent code ( $x_t \to \hat{x}_t$ via $z_s$ ).
- Formula: $\mathcal{L}_{total} = \sum \mathcal{L}_{self} + \lambda \sum_{s \neq t} \mathcal{L}_{cross}$ .
- Effect: This provides a "soft semantic constraint," forcing the model to learn representations that are not just structurally aligned but semantically transferable between modalities (e.g., image features reconstructing text features).

3. Key Contributions

Unified Latent Space: SPARC creates a shared sparse space where individual dimensions correspond to similar high-level concepts across diverse architectures (DINO, CLIP) and modalities (Vision, Text).
Solving the "Dead Neuron" Problem: By enforcing Global TopK, SPARC ensures that if a concept is represented in one model, it is represented in all, preventing the fragmentation seen in USAE.
Cross-Modal Capabilities: The framework enables practical applications previously difficult with isolated SAEs, such as text-guided spatial localization in vision-only models (e.g., using a text prompt to highlight "cat" in a DINO feature map).
Scalability: The method allows researchers to analyze concept representations once for a unified space rather than repeating analysis for every new architecture.

4. Experimental Results

The authors evaluated SPARC on Open Images (training) and MS-COCO (downstream tasks), comparing against Local TopK (standard SAE) and USAE.

Concept Alignment (Jaccard Similarity):
- SPARC (Global TopK + Cross-Loss) achieved a Jaccard similarity of 0.80, indicating high overlap in concept profiles across streams.
- USAE achieved only 0.22, and Local TopK achieved 0.26.
- This demonstrates that SPARC's structural constraints are essential for true semantic alignment.
Neuron Activation Consistency:
- Global TopK: 84.4% of neurons were "all-alive" (active across all streams) or "all-dead."
- Local TopK: 48.8% of neurons showed "mixed" activation (active in some streams but dead in others), leading to alignment failures.
- USAE: Only 45.3% all-alive, with highly uneven dead-neuron rates across streams.
Reconstruction Quality ( $R^2$ ):
- SPARC achieved positive cross-stream $R^2$ scores (0.40–0.56), whereas Local TopK often resulted in negative $R^2$ (worse than predicting the mean) for cross-modal reconstruction, particularly when reconstructing DINO features from CLIP.
Downstream Applications:
- Semantic Segmentation: SPARC enabled weakly supervised segmentation with an mIoU of 0.143 (DINO backbone), approaching the performance of native cross-modal CLIP similarity (0.157). USAE variants performed significantly worse (0.096).
- Retrieval: SPARC showed improved cross-modal retrieval capabilities, successfully retrieving captions from images and vice versa across different model backbones.

5. Significance and Impact

Interpretability at Scale: SPARC solves the scalability bottleneck of interpretability. Instead of manually aligning concepts between every pair of models, researchers can now analyze a single, unified latent space.
Cross-Modal Understanding: It provides empirical evidence that different architectures (trained with different objectives) converge on similar semantic representations when forced to share a sparse dictionary.
Practical Utility: The ability to use text prompts to control or visualize features in vision-only models (like DINO) opens new avenues for debugging, safety analysis, and controllable AI without requiring retraining of the base model.
Safety & Ethics: The paper acknowledges dual-use risks (e.g., surveillance, transferring harmful concepts) but argues that the ability to audit models for shared biases and failure modes is a net positive for AI safety.

In conclusion, SPARC represents a significant leap forward in mechanistic interpretability, moving from isolated model analysis to a holistic, cross-model understanding of how AI systems encode the world.