CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

The Big Problem: The "One-Shot" Guess

Imagine you are trying to understand how a complex machine (like a modern AI) works. You want to find the specific gears and levers inside it that make it do a specific task, like answering a math question.

Currently, scientists do this by running a test and then "pruning" the results. Think of it like a gardener trimming a massive, overgrown bush to find the main branches.

The Problem: The gardener has to decide how much to cut. If they cut a little, they get a huge, messy bush. If they cut a lot, they might accidentally chop off a vital branch.
The Issue: There is no "perfect" amount to cut. If the scientist changes their mind by just a tiny bit (e.g., "I'll cut 90% instead of 89%"), they might end up with a completely different set of branches. This makes the explanation feel brittle and unreliable. It's like getting a different map every time you ask for directions, depending on which compass the mapmaker happened to use.

The Solution: CIRCUS (The "Committee of Experts")

The authors of this paper, Swapnil Parekh, propose a method called CIRCUS (Circuit Consensus under Uncertainty via Stability Ensembles).

Instead of asking one gardener to trim the bush once and hoping for the best, CIRCUS asks a whole committee of gardeners to trim the same bush, but each one uses slightly different rules.

The Ensemble (The Committee): The computer runs the analysis many times with slightly different settings (different "pruning thresholds").
The Vote: After all the gardeners are done, the system looks at the branches.
- If every single gardener kept a specific branch, that branch is 100% stable. It's a "Core" branch.
- If only half the gardeners kept a branch, it's contingent (it might be important, but it's shaky).
- If only one gardener kept it, it's likely noise (just a random artifact).
The Consensus: The final "circuit" is built only from the branches that everyone agreed on.

The Magic Results

The paper tested this on two popular AI models (Gemma and Llama) and found some surprising things:

Tiny but Mighty: The "Core" circuit (the branches everyone agreed on) was 40 times smaller than the total mess you get if you just combined all the gardeners' work. Yet, it explained almost just as much of how the AI thinks.
Better than Random: If you took the total mess and tried to cut it down to the same small size just by picking the "loudest" branches, it explained less than the Consensus method. The "agreement" method is smarter than just picking the biggest pieces.
Real Causality: To prove these branches actually do the work, the researchers performed a "surgery" (called activation patching). They swapped the AI's brain parts with the ones identified by the Consensus. The AI worked perfectly. When they swapped in random parts, it failed. This proved the Consensus circuit is causally real, not just a lucky guess.

The "Core, Contingent, Noise" Taxonomy

CIRCUS doesn't just give you one answer; it gives you a report card with three categories:

The Core (The Rock): These are the edges (connections) that appeared in every view. These are the trustworthy, auditable facts. You can bet your life on these.
The Contingent (The Maybe): These appeared in some views but not others. They are "alternative pathways." If the Core is the main highway, these are the scenic backroads that might work, but only under certain conditions.
The Noise (The Static): These appeared rarely. They are likely just glitches or artifacts of the specific settings used. You can safely ignore them.

Why This Matters

Think of it like a jury trial.

Old Way: One judge looks at the evidence and gives a verdict. If the judge had a bad day or a slight bias, the verdict might be wrong.
CIRCUS Way: You have a jury of 25 people. If 25/25 people agree the defendant is guilty, that is a strong, stable verdict. If only 13/25 agree, you know there is uncertainty, and you shouldn't act on it as a fact.

Summary

CIRCUS is a tool that stops scientists from trusting a single, fragile explanation of how an AI works. Instead, it runs the analysis many times, finds the "common ground" where everyone agrees, and gives you a small, super-reliable, and trustworthy map of the AI's brain, while clearly flagging the parts that are still uncertain.

It's the difference between saying, "I think this is the path," and saying, "We are 100% sure this is the path, and here are the other paths we aren't sure about yet."

1. Problem Statement

Mechanistic circuit discovery aims to identify sparse subgraphs within neural networks that causally support specific behaviors. Current pipelines typically involve:

Replacing MLP layers with interpretable feature models (e.g., cross-layer transcoders).
Building attribution graphs where nodes represent features, tokens, and logits, and edges represent direct effects.
Pruning the graph based on cumulative influence thresholds to retain a compact circuit.

The Core Issue: These pipelines are highly sensitive to arbitrary analyst choices, specifically:

Pruning thresholds: Small changes in node/edge thresholds yield vastly different edge sets.
Feature dictionaries: Different transcoder checkpoints produce different feature representations.

Consequently, standard "one-shot" explanations are brittle and lack a principled notion of uncertainty. There is no established method to distinguish between stable structural insights and artifacts caused by specific parameter choices.

2. Methodology: CIRCUS

The authors propose CIRCUS (Circuit Consensus under Uncertainty via Stability Ensembles), which reframes circuit discovery as an uncertainty quantification problem over analytic degrees of freedom. Instead of reporting a single graph, CIRCUS aggregates multiple views to identify a robust "core."

Key Steps:

Config-Bagging (Ensemble Generation):
- Perform a single raw attribution run.
- Apply $B$ different pruning configurations (varying thresholds and potentially dictionaries) to generate $B$ distinct pruned graphs (views).
- Note: This requires no model retraining and adds negligible computational overhead.
Stability Scoring:
- For every edge $e$ in the full graph, calculate a stability score $s(e)$ :
  $s(e) = \frac{1}{B} \sum_{b=1}^{B} \mathbb{I}[e \in E(b)]$
- $s(e)$ represents the fraction of configurations that retain the edge. It measures agreement across analyst choices, not necessarily causal correctness.
Consensus Extraction:
- Define a consensus subgraph $C_\tau$ containing edges where $s(e) \geq \tau$ .
- Strict Consensus ( $\tau = 1$ ): Edges appearing in all views. This forms the "core" circuit.
- Exploratory Consensus ( $\tau < 1$ ): Edges appearing in a majority of views, used to surface contingent alternatives.
Uncertainty Decomposition & Taxonomy:
The method categorizes edges into three tiers:
- Core: $s(e) = 1$ . Threshold-robust, high-confidence structure.
- Contingent: Medium stability ( $0 < s(e) < 1$ ) but high marginal influence. These represent alternative pathways dependent on specific config choices.
- Noise: Low stability and low influence. These can be rejected.
Boosting (Residual Analysis):
- If the core consensus ( $C_1$ ) retains low total influence, a boosting step is performed.
- A residual graph is constructed by zeroing out $C_1$ edges.
- A second circuit ( $C_2$ ) is extracted from the residual graph to capture missing influence, creating a tiered explanation ( $C_1 \cup C_2$ ).

3. Key Contributions

Methodological Innovation: Introduces a "bagged" attribution pipeline that treats pruning thresholds as a design set, quantifying structural uncertainty via edge stability scores.
Threshold-Robustness: Demonstrates that strict consensus circuits are significantly smaller than the union of all configurations while retaining comparable explanatory power.
Actionable Taxonomy: Provides a framework to report "Core" (stable), "Contingent" (disputed but influential), and "Noise" (unreliable) edges, allowing users to reject low-agreement structures.
Efficiency: The method aggregates structure from existing pruned graphs, adding negligible cost (milliseconds) compared to the initial attribution computation.

4. Experimental Results

The method was evaluated on Gemma-2-2B and Llama-3.2-1B models using cross-layer transcoders.

Size vs. Influence Trade-off:
- Strict consensus circuits ( $\tau=1$ ) were ~40× smaller than the union of all configurations (e.g., 625 edges vs. 25,478 edges) while retaining similar Influence Retained (IR) scores (0.78 vs. 0.93).
- Consensus outperformed a "same-edge-budget" baseline (union pruned to match consensus size), achieving higher IR (0.78 vs. 0.73).
Causal Validation (Activation Patching):
- Nodes identified by the consensus were tested via activation patching (replacing corrupted activations with clean ones).
- Consensus nodes consistently outperformed matched non-consensus controls with high statistical significance ( $p = 0.0004$ ).
- This confirms that consensus edges are not just stable artifacts but causally relevant to the model's predictions.
Robustness:
- Across 20 different prompts, the consensus maintained high IR (mean 0.83, min 0.77), satisfying sanity checks in 100% of cases.
- Stability-weighted selection (prioritizing edges by $stability \times influence$ ) improved robustness in worst-case scenarios compared to selecting purely by influence.
Stability-Influence Correlation:
- Edges present in all configurations had ~70× higher mean influence than edges present in only one configuration, validating that high-stability edges are both stable and consequential.

5. Significance and Impact

Trustworthy Interpretability: CIRCUS moves the field from "one-shot" explanations to auditable, uncertainty-aware reporting. It explicitly separates stable mechanisms from artifacts caused by analyst choices.
Practical Utility: By providing a "Core/Contingent/Noise" decomposition, it gives researchers a principled way to reject unreliable structures and focus on the robust core of a model's computation.
Low Overhead: Since it operates on existing attribution runs without retraining, it is immediately applicable to current mechanistic interpretability workflows.
Future Direction: The paper establishes a foundation for future work on replacement-model masking and multi-CLT alignment, addressing the "faithfulness" gap in interpretability.

In summary, CIRCUS provides a rigorous statistical framework to handle the inherent variability in mechanistic circuit discovery, ensuring that reported circuits are robust to arbitrary parameter choices and causally validated.

CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

The Big Problem: The "One-Shot" Guess

The Solution: CIRCUS (The "Committee of Experts")

The Magic Results

The "Core, Contingent, Noise" Taxonomy

Why This Matters

Summary

1. Problem Statement

2. Methodology: CIRCUS

Key Steps:

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Diffusion Language Models Know the Answer Before Decoding

Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá