CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models

The Big Problem: The "Black Box" Artist

Imagine you have a brilliant artist (a Visual Classifier) who can look at a photo and instantly tell you if it's a "goldfish" or a "shark." They are incredibly accurate. But there's a catch: they are a Black Box. They give you the answer, but they won't tell you why. Did they see the fins? The color? The shape of the tail? You have no idea.

In the world of AI, we want to know why the model made a decision. This is where Concept Bottleneck Models (CBMs) come in. Instead of just saying "Goldfish," a CBM tries to say: "I see fins, I see orange scales, I see water, therefore it is a Goldfish." This makes the AI explainable.

The Old Way: The Expensive, Biased Translator

Previously, to make these "explainable" models, researchers had to use a giant, pre-trained translator named CLIP.

The Analogy: Imagine you have your local artist (the legacy model). To make them explainable, you force them to speak through a giant, expensive, global translator (CLIP).
The Problem:
1. Cost: CLIP is huge and requires massive computing power.
2. Bias: The translator has its own personality. If the translator thinks "shark" always means "danger," your local artist might start thinking that too, even if they didn't originally. You lose the artist's unique style.
3. Manual Labor: Sometimes, you had to hire humans to manually label every single picture with concepts (e.g., "this has fins"), which is slow and expensive.

The New Solution: "TextUnlock"

The authors of this paper invented a new method called TextUnlock. They wanted to make any existing AI model explainable without using the giant CLIP translator, without hiring humans, and without slowing down the model.

Think of it like teaching your local artist a new language without forcing them to use a dictionary.

How It Works (The Magic Trick)

The Setup: You have your "Black Box" artist (the frozen classifier) and a "Text Encoder" (a tool that turns words like "goldfish" into numbers).
The Bridge (The MLP): They build a tiny, lightweight bridge (a small neural network) between the artist's brain and the text numbers.
The Training (The "Ghost" Teacher):
- Normally, to train a model, you need the right answers (labels).
- The Trick: They didn't use labels. Instead, they asked the original artist what it thought the answer was.
- They told the bridge: "Make sure that when you translate the image into text-numbers, the result looks exactly like what the original artist thought."
- It's like a student copying a master's handwriting. The student doesn't need to know what they are writing, they just need to match the master's style perfectly.
The Result: Now, the artist's vision is perfectly aligned with the text numbers.

Making it "Concept Bottleneck" (The Explainable Part)

Once the bridge is built, the model can do two amazing things:

1. The "Concept Detective" (Concept Discovery)
Because the artist now speaks the "text language," you can ask it questions it wasn't originally trained to answer.

Question: "Does this image have 'fins'?"
Process: You take the word "fins," turn it into numbers, and ask the artist: "How much does this image look like 'fins'?"
Result: The model gives you a score. It found the concept! It did this without ever being shown a picture labeled "fins." It just understood the concept because it learned the semantic space of the class names.

2. The "Unsupervised Translator" (No Linear Probe)
Usually, you need to train a separate layer to turn those "fins" and "scales" scores back into a "Goldfish" prediction.

The Innovation: The authors realized they could just calculate this mathematically using the text words themselves. They didn't need to train a new layer. It's like realizing you can solve a math problem in your head without needing a calculator.

Why This is a Big Deal (The Superpowers)

The paper tested this on 40 different AI models (from simple ones to complex ones) and found:

It's CLIP-Free: It doesn't need the giant, expensive translator. It works with any model you already have.
It's Label-Free: It doesn't need humans to label data. It learns by listening to the model's own predictions.
It's Unsupervised: It figures out how to turn concepts into final answers without extra training.
It's Better: Surprisingly, this method actually performed better than the expensive, supervised CLIP-based methods.
Zero-Shot Captioning: They even used this to make the models write descriptions of images (like "A dog playing with a ball") without ever being taught to write sentences.

The "Drake" Problem (A Small Limitation)

The authors admit one funny flaw. Because the model learns from the names of things, it can get confused by words with double meanings.

Example: If the class is "Drake" (the bird), the model might get confused with "Drake" (the rapper) because the text encoder knows the rapper is more famous.
The Fix: They found this happens very rarely and doesn't really hurt the final answer, but it's something to watch out for.

Summary Analogy

Imagine you have a Master Chef who makes the best soup but refuses to share the recipe.

Old Way: You hire a famous food critic (CLIP) to taste the soup and guess the ingredients. But the critic has weird tastes and charges a fortune.
New Way (TextUnlock): You build a tiny translator that listens to the Chef's internal thoughts. You teach the translator to mimic the Chef's "flavor profile." Suddenly, the translator can tell you, "The Chef used carrots and cumin," even though the Chef never said it. And the Chef still makes the soup exactly the same way, tasting just as good as before.

This paper gives us a way to open the "Black Box" of AI, make it explainable, and do it all for free, without needing the expensive tools everyone else uses.

1. Problem Statement

Concept Bottleneck Models (CBMs) are designed to improve the interpretability of deep learning classifiers by mapping visual features into human-interpretable concepts (e.g., "has wings," "is red") before making a final prediction. However, existing CBMs face three significant limitations:

Dependency on CLIP: Modern "Label-Free" CBMs rely heavily on the CLIP model to generate image-concept annotations. This anchors the legacy model's reasoning to CLIP's embedding space, potentially transferring CLIP's biases (e.g., typographic bias) and failing to leverage the specific learned representations of specialized, high-performing legacy models.
Requirement for Annotations: Methods that do not use CLIP typically require labor-intensive, manual image-concept annotations to train the bottleneck layer.
Supervised Linear Probes: All existing CBMs require training a supervised linear classifier (a "linear probe") to map the extracted concept activations to class labels, necessitating ground-truth class labels.

The authors ask: How can we convert any frozen, high-performing visual classifier into an interpretable CBM without CLIP, without manual labels, and without training a supervised linear probe?

2. Methodology

The authors propose a two-stage framework: TextUnlock and U-F2-CBM.

A. TextUnlock: Aligning Vision and Text Distributions

The core innovation is TextUnlock, a method that aligns the distribution of a frozen visual classifier with its corresponding vision-language counterpart without using CLIP or ground-truth labels.

Mechanism:
- Let $F_v$ be a frozen visual encoder and $W$ be its frozen linear classifier.
- Let $T$ be a frozen text encoder (e.g., MiniLM).
- A lightweight, trainable Multi-Layer Perceptron (MLP) is introduced to project visual features ( $f$ ) from the visual space into the text embedding space ( $\tilde{f}$ ).
- Training Objective: The MLP is trained to minimize the cross-entropy loss between the original classifier's soft probability distribution ( $o$ ) and the predicted distribution derived from the projected features and text embeddings of class names.
- Key Insight: Instead of using ground-truth labels, the method treats the original model's output distribution as the "teacher." The MLP learns to map visual features to the text space such that the cosine similarity between the projected features and the text embeddings of class names reproduces the original model's class probabilities.
- Result: The visual features are now aligned with the text embedding space, allowing the model to be queried with any text prompt (not just class names) while preserving the original model's decision boundary and reasoning process.

B. U-F2-CBM: Unsupervised, Label-Free, CLIP-Free CBM

Once the classifier is aligned via TextUnlock, the CBM is constructed in two unsupervised steps:

Concept Discovery (Unsupervised):
- A predefined set of textual concepts (e.g., the 20k most common English words) is encoded using the same text encoder $T$ .
- The projected visual features $\tilde{f}$ are compared against these concept embeddings via cosine similarity to generate concept activations.
- Note: No training is required; this is a direct inference step.
Concept-to-Class Prediction (Unsupervised):
- Traditionally, a linear probe is trained to map concepts to classes. Here, the authors derive this linear classifier analytically.
- Since both the concept embeddings ( $C$ ) and the class name embeddings ( $U$ ) exist in the same text space, the weights for the concept-to-class classifier ( $W_{con}$ ) are computed as the cosine similarity matrix between concepts and class names: $W_{con} = C \cdot U^T$ .
- Mathematical Formulation: The final prediction is $\tilde{f} \cdot (C^T C) \cdot U^T$ . The term $(C^T C)$ acts as a Gram matrix scaling factor, effectively transforming the original feature-based classifier into a concept-based one without altering the underlying logic.

3. Key Contributions

CLIP-Free & Label-Free CBM: The first method to convert any frozen visual classifier (CNNs, Transformers, Hybrid) into a CBM without relying on CLIP, external vision-language models, or manual image-concept annotations.
Unsupervised Linear Probe: The first approach to derive the concept-to-class linear classifier in a fully unsupervised manner, eliminating the need for a supervised training stage for the final prediction layer.
Preservation of Reasoning: The method explicitly preserves the original classifier's decision distribution and reasoning process, avoiding the performance degradation often seen when retraining or forcing models into CLIP's embedding space.
Zero-Shot Image Captioning: The method enables zero-shot image captioning for non-CLIP models by leveraging the aligned text space and prefix-tuning a frozen language decoder.

4. Results

The method was evaluated on over 40 visual classifiers (including ResNets, ViTs, ConvNeXt, DINOv2, etc.) on the ImageNet-1K dataset and other domain-specific datasets (Places365, EuroSAT, DTD).

Classification Performance:
- The TextUnlock transformation results in a negligible accuracy drop (average $\Delta \approx -0.2\%$ ), proving the method preserves the original model's capabilities.
- State-of-the-Art (SOTA): The resulting U-F2-CBMs outperform existing supervised CLIP-based CBMs (e.g., LF-CBM, LaBo, CDM).
- Efficiency: A simple ResNet-50 trained only on ImageNet-1K (1.2M images) outperforms CLIP-based CBMs trained on 400M image-text pairs. An EfficientNetv2-S (21M params) outperforms the massive CLIP ViT-L/14 (428M params) by +5.1%.
Interpretability:
- Concept interventions (zeroing out specific concepts) successfully mitigate biases (e.g., the "waterbird" background bias), confirming the concepts are semantically meaningful and causal to the prediction.
- The method successfully identifies class-specific concepts (e.g., "harpoon" for hammerhead sharks) that distinguish them from similar classes.
Zero-Shot Captioning:
- On the COCO dataset, the method achieves SOTA performance in CIDEr and SPICE metrics, outperforming CLIP-based baselines (ZeroCap, ConZIC) despite using significantly less training data.
- Compositional captioning (using an LLM to assemble detected concepts) further boosts n-gram metrics (BLEU, METEOR).

5. Significance

This work fundamentally shifts the paradigm of interpretable AI:

Decoupling from CLIP: It demonstrates that high-quality interpretability does not require the massive, biased, and resource-heavy CLIP model. Legacy specialist models can be made interpretable on their own terms.
Data Efficiency: By removing the need for manual concept annotations and large-scale image-text pretraining, the method makes CBMs accessible for domains where data is scarce or proprietary.
Theoretical Insight: The derivation of the linear probe via text-to-text similarity ( $C \cdot U^T$ ) provides a novel theoretical link between concept spaces and class spaces, showing that the "reasoning" of a classifier can be decomposed into semantic concepts without retraining.

In summary, U-F2-CBM offers a highly efficient, scalable, and theoretically sound framework for making any visual classifier transparent, interpretable, and controllable, setting a new benchmark for the field of Concept Bottleneck Models.