Select, Hypothesize and Verify: Towards Verified Neuron Concept Interpretation

Imagine a deep neural network (the "brain" behind AI) as a massive, bustling city of millions of tiny workers (called neurons). Each worker has a specific job, like "spotting a dog's ear" or "detecting a red traffic light."

For a long time, scientists trying to understand these AI brains have been like tourists with a broken map. They look at a worker who is shouting loudly (high activation) and guess, "Ah, this guy must be the 'Dog Ear' specialist!" They write down a label and move on.

The Problem:
The paper argues that this old way of thinking has two big flaws:

The "Busybody" Problem: Some workers are just noisy. They shout loudly at random things (like a dog's ear and a cat's tail and a patch of grass) just by accident. If we label them "Dog Ear," we are lying to ourselves.
The "Guessing Game" Problem: The old methods assume that if a worker is shouting, their guess about what they are shouting about is automatically correct. They never double-check.

The Solution: SIEVE (Select–Hypothesize–Verify)
The authors propose a new framework called SIEVE. Think of it as a Scientific Detective Agency for AI neurons. Instead of just guessing, they follow a strict three-step process, inspired by how real scientists study the human brain.

Step 1: SELECT (The Filter)

The Metaphor: Imagine you are looking for a specific type of musician in a crowded orchestra. You don't just listen to anyone making noise; you look for the violinist who only plays when the song is in a specific key.
What they do: They scan the data to find neurons that are consistently excited by the same thing. If a neuron is excited by a dog's ear 99% of the time but also excited by a toaster 50% of the time, it's a "noisy" neuron. SIEVE filters these out. It only keeps the "pure" workers who have a clear, specific job.

Step 2: HYPOTHESIZE (The Detective's Theory)

The Metaphor: Now that you've found the pure violinist, you look at the sheet music they are playing and say, "I bet this guy is the 'Sad Melody' specialist." That is your hypothesis.
What they do: They take the images that made the neuron shout the loudest and use an AI (like a smart translator) to describe what those images have in common. "Oh, all these pictures have 'fluffy fur' and 'pointy ears'." They write down a label: "Fluffy Pointy Ears."

Step 3: VERIFY (The Stress Test)

The Metaphor: This is the most important part. Instead of just trusting your theory, you create a fake "Sad Melody" scenario and see if the violinist actually plays. If you play a happy song and the violinist stays silent, your theory was wrong.
What they do: They take the label they just created (e.g., "Fluffy Pointy Ears") and use a text-to-image generator (like Midjourney or DALL-E) to create brand new pictures of fluffy pointy ears.
- They feed these new pictures to the AI.
- The Question: Does the neuron we labeled "Fluffy Pointy Ears" actually light up when it sees these new pictures?
- The Result: If the neuron stays silent, the label was a lie (a "mismatched concept"), and they throw it away. If the neuron screams "YES!", the label is verified as true.

Why This Matters (The "Aha!" Moment)

In the paper's example, they looked at a neuron that seemed to be about "Small Round Beards."

Old Method: "It's a beard neuron! Done."
SIEVE Method: They generated pictures of beards. The neuron didn't react much (Activation Rate: 0.26). Verdict: Wrong.
They tried "Curly Dense Coat." The neuron went wild (Activation Rate: 0.98). Verdict: Correct.

The Bottom Line

The authors found that their new method, SIEVE, produces concepts that are 1.5 times more accurate than the current best methods.

In simple terms:
Old methods are like fortune tellers who guess what a neuron does based on a quick glance.
This new method is like engineers who build a test, run the machine, and only accept the answer if the machine proves it works. It stops us from trusting "hallucinations" and ensures that when we say an AI understands "cats," it actually does.

1. Problem Statement

Current methods for interpreting Deep Neural Networks (DNNs) aim to explain the functionality of individual neurons (often called "concepts") using natural language. However, existing approaches suffer from two critical limitations:

The Redundancy Assumption: They assume every neuron has a well-defined, discriminative function. In reality, many neurons are redundant or provide noisy, misleading signals. Interpreting these leads to incorrect understandings of the network's decision-making.
The Verification Gap: Existing methods (e.g., Network Dissection, CLIP-Dissect) rely on observational hypotheses based on activation distributions in probe datasets. They lack a mechanism to verify whether the generated concept actually causes the neuron to activate. Consequently, they may describe concepts that correlate with high activation by chance but do not causally drive the neuron's response.

2. Methodology: The SIEVE Framework

The authors propose a Select–Hypothesize–Verify (SIEVE) framework, inspired by the scientific "Observe–Hypothesize–Verify" paradigm used in neuroscience. The framework operates in three stages:

A. Select (High-Activation Sample Selection)

Goal: Filter out redundant neurons and select only samples that exhibit consistent, high-discrimination activation patterns.
Mechanism:
- Analyze the activation distribution of a neuron over a probe dataset ( $D_{probe}$ ).
- Calculate the ratio of the 99th percentile to the median of the activation distribution.
- Apply a threshold ( $\beta$ ). If the ratio exceeds $\beta$ , the neuron is considered to encode distinct features.
- Select the top 20 samples with the highest activation for further processing. This filters out "low-discrimination" neurons that respond randomly or weakly.

B. Hypothesize (Concept Generation)

Goal: Formulate natural language hypotheses about what the selected neuron detects.
Mechanism:
1. Cropping: Extract image patches from the high-activation samples based on the neuron's activation map to remove irrelevant background noise.
2. Clustering: Use agglomerative clustering (determined by Silhouette score) on the feature vectors of these patches to identify distinct functional patterns (e.g., a neuron might detect both "stripes" and "spots").
3. Matching: For each cluster, use a Vision-Language Model (e.g., CLIP) to match image patches against a predefined concept set ( $T$ ).
4. Selection: Select the top $K$ concepts with the highest cosine similarity scores as the functional hypotheses ( $H$ ) for that neuron.

C. Verify (Intervention-Based Validation)

Goal: Causally validate whether the hypothesized concept actually triggers the neuron.
Mechanism:
1. Image Generation: Use the hypothesized concept as a text prompt for a Text-to-Image model (e.g., Stable Diffusion) to generate a new set of images ( $D_{gen}$ ). This creates stimuli independent of the original probe dataset.
2. Activation Rate (AR) Measurement: Feed these generated images into the target DNN.
3. Calculation: Compute the Activation Rate (AR), defined as the proportion of generated images that elicit an activation value above a specific threshold (Top 1% of the neuron's original distribution).
4. Filtering: If the AR is low, the hypothesis is rejected as incorrect. Only concepts with high AR are retained, ensuring the explanation is causally linked to the neuron's function.

3. Key Contributions

SIEVE Framework: Introduction of a closed-loop framework that moves beyond observational hypothesis to include intervention-based verification, addressing the issue of redundant neurons and misleading concepts.
Neural Filtering: A mechanism to identify and exclude neurons that do not provide discriminative features, preventing the generation of spurious explanations.
Verification Metric: The proposal of Activation Rate (AR) as a robust, label-free metric to quantify the consistency between a generated concept and the neuron's actual behavior.
Scientific Alignment: Aligning DNN interpretability research with rigorous scientific methodology (hypothesis testing via controlled experiments).

4. Experimental Results

The authors evaluated SIEVE on ResNet-18, ResNet-50, and ViT-B/16 across datasets like ImageNet-1K, Places365, and Eurosat (for domain shift analysis).

Quantitative Performance:
- Activation Rate (AR): SIEVE significantly outperforms state-of-the-art methods (CLIP-Dissect, WWW, DnD, FALCON).
  - On ResNet-50 (Penultimate Layer), SIEVE achieved a mean AR of 86.29% (Common Words) compared to 57.91% for CLIP-Dissect.
  - This represents an approximate 1.5x improvement in the probability that generated concepts correctly activate the corresponding neurons.
- Similarity Metrics: SIEVE also achieved higher CLIP and MPNet cosine similarity scores in the final layer, indicating more accurate semantic descriptions.
Qualitative Results:
- SIEVE provides more fine-grained and localized descriptions (e.g., "Short Dense Coat" vs. generic "Dog").
- It successfully identifies multiple distinct concepts for a single neuron, whereas baselines often provide only one coarse label.
Ablation Studies:
- The Verify module had the most significant impact on performance, confirming that validation is crucial for reliability.
- The Select module effectively filtered out ambiguous cases, improving consistency.
Robustness:
- SIEVE maintained high performance under domain shift (e.g., from general images to remote sensing data), whereas baseline methods suffered significant degradation.

5. Significance

This paper fundamentally shifts the paradigm of neuron interpretability from passive observation to active verification.

Trustworthiness: By verifying that a concept causes activation, SIEVE ensures that the explanations are not just statistical correlations but reflect the true internal mechanisms of the network.
Safety: In safety-critical applications, relying on unverified neuron concepts can lead to false confidence. SIEVE's filtering mechanism prevents the misinterpretation of redundant or noisy neurons.
Methodological Advancement: It establishes a new standard for XAI research, advocating for intervention-based experiments to validate hypotheses, similar to how biological neuroscience operates.