CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

CIRCUS is a post-hoc framework that addresses the sensitivity of mechanistic circuit discovery to arbitrary analyst choices by constructing stability ensembles to extract a robust, uncertainty-aware consensus circuit that outperforms traditional baselines in both compactness and causal relevance.

Swapnil Parekh

Published 2026-03-03
📖 4 min read☕ Coffee break read

The Big Problem: The "One-Shot" Guess

Imagine you are trying to understand how a complex machine (like a modern AI) works. You want to find the specific gears and levers inside it that make it do a specific task, like answering a math question.

Currently, scientists do this by running a test and then "pruning" the results. Think of it like a gardener trimming a massive, overgrown bush to find the main branches.

  • The Problem: The gardener has to decide how much to cut. If they cut a little, they get a huge, messy bush. If they cut a lot, they might accidentally chop off a vital branch.
  • The Issue: There is no "perfect" amount to cut. If the scientist changes their mind by just a tiny bit (e.g., "I'll cut 90% instead of 89%"), they might end up with a completely different set of branches. This makes the explanation feel brittle and unreliable. It's like getting a different map every time you ask for directions, depending on which compass the mapmaker happened to use.

The Solution: CIRCUS (The "Committee of Experts")

The authors of this paper, Swapnil Parekh, propose a method called CIRCUS (Circuit Consensus under Uncertainty via Stability Ensembles).

Instead of asking one gardener to trim the bush once and hoping for the best, CIRCUS asks a whole committee of gardeners to trim the same bush, but each one uses slightly different rules.

  1. The Ensemble (The Committee): The computer runs the analysis many times with slightly different settings (different "pruning thresholds").
  2. The Vote: After all the gardeners are done, the system looks at the branches.
    • If every single gardener kept a specific branch, that branch is 100% stable. It's a "Core" branch.
    • If only half the gardeners kept a branch, it's contingent (it might be important, but it's shaky).
    • If only one gardener kept it, it's likely noise (just a random artifact).
  3. The Consensus: The final "circuit" is built only from the branches that everyone agreed on.

The Magic Results

The paper tested this on two popular AI models (Gemma and Llama) and found some surprising things:

  • Tiny but Mighty: The "Core" circuit (the branches everyone agreed on) was 40 times smaller than the total mess you get if you just combined all the gardeners' work. Yet, it explained almost just as much of how the AI thinks.
  • Better than Random: If you took the total mess and tried to cut it down to the same small size just by picking the "loudest" branches, it explained less than the Consensus method. The "agreement" method is smarter than just picking the biggest pieces.
  • Real Causality: To prove these branches actually do the work, the researchers performed a "surgery" (called activation patching). They swapped the AI's brain parts with the ones identified by the Consensus. The AI worked perfectly. When they swapped in random parts, it failed. This proved the Consensus circuit is causally real, not just a lucky guess.

The "Core, Contingent, Noise" Taxonomy

CIRCUS doesn't just give you one answer; it gives you a report card with three categories:

  1. The Core (The Rock): These are the edges (connections) that appeared in every view. These are the trustworthy, auditable facts. You can bet your life on these.
  2. The Contingent (The Maybe): These appeared in some views but not others. They are "alternative pathways." If the Core is the main highway, these are the scenic backroads that might work, but only under certain conditions.
  3. The Noise (The Static): These appeared rarely. They are likely just glitches or artifacts of the specific settings used. You can safely ignore them.

Why This Matters

Think of it like a jury trial.

  • Old Way: One judge looks at the evidence and gives a verdict. If the judge had a bad day or a slight bias, the verdict might be wrong.
  • CIRCUS Way: You have a jury of 25 people. If 25/25 people agree the defendant is guilty, that is a strong, stable verdict. If only 13/25 agree, you know there is uncertainty, and you shouldn't act on it as a fact.

Summary

CIRCUS is a tool that stops scientists from trusting a single, fragile explanation of how an AI works. Instead, it runs the analysis many times, finds the "common ground" where everyone agrees, and gives you a small, super-reliable, and trustworthy map of the AI's brain, while clearly flagging the parts that are still uncertain.

It's the difference between saying, "I think this is the path," and saying, "We are 100% sure this is the path, and here are the other paths we aren't sure about yet."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →