What Helps---and What Hurts: Bidirectional Explanations for Vision Transformers

Imagine you have a very smart, but slightly mysterious, art critic named ViT (Vision Transformer). This critic is amazing at looking at a photo and telling you exactly what's in it—like spotting a "zebra" or a "traffic light." But here's the problem: if you ask the critic why they made that choice, they just stare back silently. They don't explain their reasoning.

This paper introduces a new tool called BiCAM (Bidirectional Class Activation Mapping) to act as a translator for this critic. It helps us understand not just what the critic likes, but also what they dislike to make their decision.

Here is the breakdown using simple analogies:

1. The Old Way: Only Listening to the "Yes"

Previously, when people tried to understand these AI critics, they used methods that only looked at the positive signals.

The Analogy: Imagine the critic is a judge at a talent show. If the judge says, "I'm voting for the singer," the old methods would only highlight the singer's voice. They would ignore everything else in the room.
The Problem: This is incomplete. A judge might vote for the singer because the background noise is terrible, or because the other contestants are bad. By ignoring the "negative" signals (what the judge is rejecting), the explanation feels half-baked and sometimes misleading.

2. The New Way: BiCAM (The "Yes" and "No" Translator)

BiCAM changes the game by listening to both the supportive evidence (the "Yes") and the suppressive evidence (the "No").

The Analogy: BiCAM gives the critic a highlighter pen with two colors: Red and Blue.
- Red (Supportive): "I see a zebra here, and that's why I'm saying 'Zebra'."
- Blue (Suppressive): "I see a tiger in the background, and I am actively ignoring it to make my decision."
Why it's cool: In a photo with a zebra and a tiger, old methods might just highlight the whole messy scene. BiCAM clearly says, "The red part is the zebra (the answer), and the blue part is the tiger (the thing I'm rejecting)." This creates a much clearer, "contrastive" picture of how the AI thinks.

3. How It Works: The "Deep Dive" Strategy

The paper also explains how BiCAM finds these answers. It doesn't look at every single step the AI takes, because the early steps are just about basic shapes (like "is this a line or a curve?").

The Analogy: Think of the AI's brain as a factory assembly line.
- Early Stations: Workers are just sorting raw materials (lines, colors).
- Late Stations: Workers are assembling the final product and making the final decision.
BiCAM's Trick: It ignores the noisy early stations and only listens to the last few stations of the assembly line. This is where the "real" decision happens. By focusing only there, it avoids getting confused by background noise and gives a sharper answer.

4. The "Sniff Test" for Fake Photos (Adversarial Detection)

The authors also created a simple math trick called PNR (Positive-to-Negative Ratio). This is like a lie detector test for AI.

The Analogy:
- Real Photos: When an AI looks at a real photo of a dog, it says, "Yes, dog!" (Red) and "No, cat!" (Blue). The balance between "Yes" and "No" is natural and organized.
- Fake/Attacked Photos: Hackers can create "adversarial examples"—photos that look like a dog to a human but are actually just static noise designed to trick the AI.
- The Result: When the AI looks at a fake photo, its "Yes" and "No" signals get scrambled. It might scream "YES!" everywhere, or get confused.
The PNR Meter: BiCAM calculates the ratio of "Yes" to "No." If the ratio is weirdly off-balance (too much "Yes" or too much "No" in the wrong places), the PNR meter beeps: "This photo is likely a fake attack!"
The Benefit: This detects hackers without needing to retrain the AI or use heavy computers. It's a lightweight, instant check.

5. Why This Matters

Trust: It makes AI less of a "black box." We can finally see the full reasoning, including what the AI is rejecting.
Safety: It helps us spot when someone is trying to trick the AI, which is crucial for things like self-driving cars or medical diagnosis.
Efficiency: It works fast and doesn't require the AI to learn anything new.

In a nutshell: BiCAM is like giving the AI a two-sided mirror. Instead of just showing us what the AI sees, it shows us what the AI sees and what it is actively ignoring. This makes the AI's decisions clearer, more accurate, and harder to fool.

1. Problem Statement

Vision Transformers (ViTs) have achieved state-of-the-art performance in visual recognition tasks but suffer from a lack of interpretability ("black-box" nature). Existing explainability methods for ViTs face several critical limitations:

Loss of Negative Information: Most Class Activation Mapping (CAM) and attention-based methods discard negative values (suppressive evidence), focusing only on positive relevance. This prevents a complete understanding of why a model rejects certain classes or background elements.
Over-smoothing: Recursive attention aggregation methods (e.g., Attention Rollout) often blur token differences, making it difficult to localize specific objects.
Computational Inefficiency: Gradient-based methods often require full-network aggregation, while Shapley-value-based methods require extensive per-dataset retraining.
Lack of Robustness Signals: Existing methods do not effectively utilize attribution patterns to detect adversarial attacks without retraining the model.

2. Methodology: BiCAM

The authors propose BiCAM (Bidirectional Class Activation Mapping), a method designed to capture both supportive (positive) and suppressive (negative) contributions to model predictions.

A. Strategic Layer Aggregation

Unlike prior works that aggregate signals across all layers, BiCAM focuses on deeper transformer layers ( $L-\ell+1$ to $L$ ).

Rationale: Class-discriminative information concentrates in later layers, while early layers contain low-level structural noise.
Configuration: The authors empirically set the aggregation window size $\ell = 2L/3$ (e.g., the last 8 layers of a 12-layer ViT).
Aggregation: Instead of recursive matrix multiplication or weighted combinations, BiCAM uses simple summation of layer-wise masks to preserve independent contributions and avoid over-smoothing.

B. Attribution Mechanism

BiCAM computes bidirectional maps through a single forward-backward pass:

Extraction: Extracts attention maps ( $A$ ), value projections ( $V$ ), and class-specific gradients ( $\partial y_c / \partial o_{cls}$ ) from the selected deeper layers.
Temperature Scaling: Applies softmax with temperature scaling to control attention entropy and improve stability.
Gradient Weighting: Computes layer-wise masks by element-wise multiplying the gradient-weighted values with the attention maps:
$\text{mask}^{(l)} = \sum_{h=1}^{H} \left( (V_h^{(l)} \cdot w_c^{(l)}) \odot \alpha_h^{(l)} \right)$
Preservation of Signs: Crucially, no ReLU or clipping is applied. This ensures that negative gradients (suppressive evidence) are preserved, allowing the model to highlight regions that decrease the class score.

C. Positive-to-Negative Ratio (PNR)

To leverage the bidirectional nature of the attributions, the authors introduce PNR, a lightweight metric for adversarial detection:
$\text{PNR} = \frac{\sum \text{ReLU}(M_i)}{\sum \text{ReLU}(-M_i) + \epsilon}$

Hypothesis: Clean samples exhibit a structured balance between supportive and suppressive regions. Adversarial perturbations disrupt this balance, often inflating the positive dominance or creating dispersed noise, leading to a skewed PNR.
Detection: The perturbation metric $\Delta \text{PNR} = \text{PNR}_{\text{adv}} - \text{PNR}_{\text{clean}}$ is used to detect attacks (e.g., PGD, C&W, MI-FGSM) without retraining the model.

3. Key Contributions

BiCAM Framework: A novel bidirectional attribution method that generates contrastive explanations (highlighting both "why" a class is predicted and "why" alternatives are rejected) using a single forward-backward pass.
Strategic Layer Aggregation: A principled approach focusing on deeper layers to filter noise and enhance class-discriminative signals, outperforming full-layer aggregation.
PNR Metric: A simple, training-free metric derived from BiCAM that effectively detects adversarial examples by quantifying the distortion in positive-negative attribution balance.
Generalization: Demonstrated applicability across various ViT architectures, including standard ViT, DeiT, and Swin.

4. Experimental Results

The method was evaluated on ImageNet, PASCAL VOC, and COCO datasets against baselines like Attention Rollout, LRP-based CAM, AG-CAM, and ViT-Shapley.

Localization Performance:
- On ImageNet, BiCAM achieved the highest IoU (0.5419), F1 (0.6624), and Recall (0.9288).
- On VOC and COCO, BiCAM (Positive channel) outperformed all baselines in IoU, F1, and Precision.
- The negative channel (BiCAM Neg) successfully identified semantically meaningful competing regions, a capability absent in magnitude-only methods.
Faithfulness:
- Evaluated via feature removal (MIF/LIF). BiCAM showed the highest Faithfulness scores across all datasets (e.g., 0.3824 on ImageNet vs. 0.3691 for AG-CAM), indicating better alignment between attribution importance and actual model behavior.
Adversarial Detection:
- Using $\Delta \text{PNR}$ , BiCAM achieved strong detection performance on multi-object scenes (VOC).
- AUROC: ~0.796 (Average across PGD, C&W, MI-FGSM).
- AUPR: ~0.763.
- This was achieved without any model retraining.
Efficiency:
- BiCAM is 8.4× faster than LRP-based methods (16.0 ms/img vs. 134.6 ms/img) and requires negligible GPU memory overhead compared to methods storing intermediate relevance scores.
- It avoids the 19-hour per-dataset training time required by ViT-Shapley.

5. Significance and Conclusion

The paper establishes that modeling both supportive and suppressive evidence is crucial for interpreting Vision Transformers.

Interpretability: By preserving signed attributions, BiCAM provides richer, contrastive explanations that reveal how models distinguish between target objects and background/competitors.
Security: The PNR metric demonstrates that adversarial attacks leave a distinct "signature" in the balance of bidirectional attributions, enabling efficient, training-free detection.
Future Impact: The work suggests that bidirectional analysis is an underexplored but vital dimension for trustworthy AI, with potential applications in out-of-distribution detection and architectural improvements.

Limitations noted by the authors: The lack of ground-truth attributions (relying on indirect proxies like IoU), sensitivity to gradient noise, and the need for comparison against adaptive adversaries in future work.