Learning Concept Bottleneck Models from Mechanistic Explanations

Imagine you have a brilliant but mysterious chef (the Black-Box AI) who can cook a perfect meal every time. You ask, "How did you make this taste so good?" The chef just shrugs and says, "I just know." This is how most powerful AI models work today: they give great answers, but we don't know why.

To fix this, scientists created Concept Bottleneck Models (CBMs). Think of these as forcing the chef to write down a recipe before cooking. Instead of just saying "Delicious Soup," the chef must first say, "I used carrots, onions, and salt," and then combine them to make the soup. This makes the AI transparent.

The Problem with Old Recipes
The problem with previous methods was that humans had to guess the recipe ingredients. They might say, "Maybe the chef used 'spiciness' or 'crunchiness'?" But what if the chef actually used "a specific type of basil" or "a pinch of sea salt"? If the human guesses the wrong ingredients, the recipe explanation is useless, and the chef can't cook as well.

The New Solution: M-CBM (The Mechanistic Chef)
This paper introduces a new method called M-CBM. Instead of guessing the ingredients, M-CBM asks the chef to reveal their own internal thought process.

Here is how it works, step-by-step, using a simple analogy:

1. The "X-Ray" (Sparse Autoencoders)

Imagine the chef's brain is a giant, messy library where thousands of ideas are mixed together in the same book. It's hard to read.
The researchers use a tool called a Sparse Autoencoder (SAE). Think of this as a magical X-ray that sorts the library. It takes the messy mix of ideas and separates them into individual, neat drawers. Each drawer now contains just one specific thing the chef cares about, like "redness," "stripes," or "a specific type of leaf."

2. The "Translator" (Multimodal LLM)

Now we have 2,000 neat drawers, but they are just labeled with numbers. We need to know what's inside.
The researchers use a Multimodal Large Language Model (MLLM)—basically a super-smart robot that can see pictures and read text. They show the robot a few pictures that make "Drawer #42" light up and a few that don't. The robot looks at the pictures and says, "Ah, this drawer is for 'a bird with a yellow beak'."
The robot gives a human-readable name to every single drawer.

3. The "Quality Control" (Annotation)

Just because the robot gave a name doesn't mean it's perfect. So, the robot acts as a quality inspector. It looks at a sample of photos and checks: "Does this photo have a 'yellow beak'? Yes or No?" It creates a clean list of facts for the AI to learn from.

4. The Final Recipe (The Bottleneck)

Finally, they train the AI to use these specific, verified facts (the drawers) to make its decision.

Old Way: "I think it's a bird because... well, it looks like a bird." (Vague, prone to cheating).
M-CBM Way: "I see a 'yellow beak' (Fact 1) and 'feathers' (Fact 2). Therefore, it is a bird." (Clear, honest, and accurate).

Why is this a big deal?

The paper introduces a new way to measure how "concise" the explanation is, called NCC (Number of Contributing Concepts).

Imagine you are explaining a crime.
Bad Explanation: "The suspect did it because of 500 different reasons, including the weather, the moon phase, and the color of their socks." (Too many reasons, confusing).
Good Explanation: "The suspect did it because they were at the scene and had the weapon." (Just 2 reasons, clear).

M-CBM forces the AI to find the fewest number of "drawers" needed to make the right decision. The results show that this method is:

More Accurate: It predicts better than other "explainable" AI models.
More Honest: It doesn't cheat by hiding secrets in the explanation.
More Human-Friendly: The explanations are actual concepts humans understand (like "stripes" or "blue eyes"), not just math numbers.

In a Nutshell:
Instead of guessing what the AI is thinking, M-CBM opens the AI's "brain," sorts its thoughts into neat, labeled boxes, and forces it to explain its decisions using only those boxes. It turns a mysterious black box into a transparent, honest chef who can actually tell you what's in the soup.

1. Problem Statement

Concept Bottleneck Models (CBMs) are designed to provide ante-hoc interpretability by forcing a model to predict a set of human-understandable concepts before making a final decision. However, existing state-of-the-art CBMs suffer from two critical limitations:

Suboptimal Concept Selection: Current methods rely on human experts, knowledge graphs, Large Language Models (LLMs), or general vision-language models (like CLIP) to define concepts a-priori. These pre-defined concepts often lack sufficient predictive power for specific tasks or are not learnable from the available data.
Information Leakage: To compensate for poor concept selection, the final prediction layer often learns to "cheat" by encoding hidden class-relevant patterns directly into the concept layer, bypassing the intended interpretability. This results in CBMs that perform significantly worse than their black-box counterparts when information leakage is controlled, or conversely, achieve high accuracy only by becoming effectively black-boxes.

The core problem is the disconnect between the concepts humans think are important and the features the model actually learns to solve the task.

2. Methodology: Mechanistic CBM (M-CBM)

The authors propose M-CBM, a novel pipeline that constructs the concept bottleneck directly from the internal representations of a pre-trained black-box model, rather than guessing concepts externally. The pipeline consists of four main stages (illustrated in Figure 1 of the paper):

A. Concept Extraction via Sparse Autoencoders (SAEs)

Instead of using external definitions, M-CBM decomposes the features learned by a black-box backbone ( $\phi$ ) into interpretable directions using Sparse Autoencoders (SAEs).

Process: An SAE is trained to reconstruct the backbone's feature activations ( $a$ ) while enforcing sparsity in the hidden layer ( $h$ ).
Goal: To disentangle the superposition of features (where multiple concepts are encoded in a single neuron) into monosemantic features. Each active neuron in the SAE represents a distinct, learnable concept.
Filtering: Neurons that are "dead" (never activate) or "nearly dead" are pruned to ensure the concept set is meaningful and computationally efficient.

B. Concept Naming via Multimodal LLMs

Once the SAE neurons are identified, they must be assigned human-readable names.

Activation Analysis: For each neuron, the system selects images that maximally activate it and generates saliency maps to highlight relevant regions.
Prompting: A Multimodal Large Language Model (MLLM, e.g., GPT-4.1) is prompted with pairs of "highly activating" and "non-activating" images to generate a concise natural language description of the concept.
Merging: To avoid redundancy, semantically similar names (based on text embeddings) are merged.

C. Dataset Annotation

The MLLM is then used to annotate a subset of the dataset (approx. 1,000 samples per concept) with binary labels indicating the presence or absence of each concept.

Strategy: The annotation uses a batched approach (25 images per prompt) to reduce costs. It includes a reference grid of top-activating images to ground the concept visually.
Output: A ternary vector $z \in \{-1, 0, 1\}^K$ for each image, where -1 indicates the concept was not annotated for that image.

D. Training the Concept Bottleneck Model

A sequential CBM is trained using these annotations:

Frozen Backbone: The original black-box model extracts features.
Concept Bottleneck Layer (CBL): A multi-label classifier predicts the presence of the $K$ named concepts from the features. It is trained using a masked Binary Cross-Entropy (BCE) loss, ignoring unannotated entries.
Sparse Linear Classifier: A final layer maps the concept predictions to the target classes. Crucially, this layer is trained with an elastic-net penalty to enforce sparsity.

3. Key Contributions

1. Mechanistic CBM (M-CBM) Pipeline

The paper introduces a fully automated pipeline that bridges Mechanistic Interpretability (using SAEs to find features) and Concept Bottleneck Models. By deriving concepts from the model's own learned representations, M-CBM ensures the concepts are inherently predictive and learnable.

2. Number of Contributing Concepts (NCC)

The authors introduce NCC, a new evaluation metric to control and measure information leakage.

Limitation of Previous Metrics: Existing metrics like Number of Effective Concepts (NEC) count non-zero weights, which imposes a hard cap on the vocabulary size per class. This fails to account for intra-class diversity where a class might need many potential concepts, but only a few are active for a specific image.
NCC Definition: NCC measures sparsity at the decision level. It calculates the minimum number of concepts required to explain a fraction ( $\tau$ , e.g., 95%) of the prediction's magnitude.
Benefit: NCC allows for a fair comparison of CBMs by matching them on the conciseness of their explanations rather than just weight sparsity, effectively controlling for information leakage.

3. Empirical Superiority

The paper demonstrates that M-CBMs consistently outperform prior CBMs (LF-CBM, VLG-CBM, DN-CBM) across diverse datasets (CUB, ISIC2018, ImageNet) when matched for sparsity (NCC).

4. Experimental Results

The authors evaluated M-CBM on three datasets: CUB (birds), ISIC2018 (skin lesions), and ImageNet.

Accuracy vs. Sparsity:
- At high sparsity (low NCC, e.g., NCC=5), M-CBM achieves significantly higher accuracy than baselines. For example, on ImageNet at NCC=5, M-CBM achieves 72.18% accuracy, compared to 62.20% for LF-CBM and 60.23% for DN-CBM (ResNet backbone).
- Baselines often fail at low sparsity because their pre-defined concepts are insufficient to cover the task complexity.
Concept Prediction Quality:
- M-CBM shows superior ability to predict its own concepts (measured by ROC-AUC on the test set). On CUB, M-CBM achieves a macro-ROC-AUC of 90.04%, significantly outperforming VLG-CBMCA (62.03%). This suggests the SAE-extracted concepts are more aligned with the visual data than LLM-generated ones.
Leakage Control:
- Experiments replacing concepts with random words showed that class-conditioned baselines (like original VLG-CBM) could achieve black-box accuracy with very few concepts (NCC $\approx$ 1.5), indicating severe leakage. M-CBM and class-agnostic baselines showed the expected trade-off: accuracy drops as sparsity increases, confirming that M-CBM relies on actual concept semantics.
Explanation Quality:
- Qualitative analysis (Sankey diagrams and per-image explanations) shows M-CBM provides concise, intuitive explanations. For instance, distinguishing "Modem" vs. "Radio" based on specific features like "indicator lights" vs. "control knobs."

5. Significance and Limitations

Significance:

Paradigm Shift: M-CBM moves away from "human-in-the-loop" concept definition toward "machine-in-the-loop" concept discovery. It leverages the fact that modern AI models often learn better features than humans can articulate.
Robust Interpretability: By using SAEs, the method ensures concepts are disentangled and predictive, solving the "unlearnable concept" problem of previous CBMs.
New Metric: The introduction of NCC provides a more nuanced way to evaluate the trade-off between accuracy and interpretability, addressing the limitations of weight-based sparsity metrics.

Limitations:

Annotation Cost: The pipeline relies on MLLMs (like GPT-4.1) for naming and annotating, which incurs significant computational cost and API expenses, especially for large datasets like ImageNet.
Black-Box Concept Prediction: While the final decision layer is interpretable, the step where the backbone predicts the concepts (the CBL) remains a black box. There is no systematic way to verify if the model is learning the intended concept or a spurious correlation.
Leakage Persistence: While NCC controls leakage, it does not eliminate it entirely. CBMs trained on random words can still achieve surprisingly high accuracy, suggesting that some information leakage is inherent to the architecture.

Conclusion:
M-CBM represents a significant step forward in making deep learning models interpretable by design. By extracting concepts directly from the model's mechanistic internals, it achieves a better balance between predictive performance and explainability than previous methods that rely on external, potentially mismatched concept definitions.