Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Imagine you have a super-smart robot that can read a book, watch a movie, or look at a photo and tell you exactly what it thinks about it. Maybe it says, "This movie is terrible!" or "This picture is a cat!"

The problem is, this robot is a black box. It gives you the answer, but it doesn't tell you why. It's like a chef who hands you a delicious soup but refuses to tell you what ingredients went into it or how they cooked it. You just have to trust them.

In the world of Artificial Intelligence (AI), these robots are called Transformers. They are incredibly powerful, but because they have so many layers of "thinking" (like a multi-story building), it's very hard to figure out which specific words or pixels made them make their decision.

This paper introduces a new tool called CA-LIG (Context-Aware Layer-wise Integrated Gradients). Think of it as a super-powered detective that doesn't just look at the final answer; it investigates the entire building, floor by floor, to see exactly how the robot reached its conclusion.

Here is how CA-LIG works, broken down into simple concepts:

1. The Problem with Old Detectors

Previous methods of explaining AI were like looking at a building through a telescope from the street.

The "Final Layer" Bias: Most old tools only looked at the very top floor (the final answer). They missed all the interesting work happening on the lower floors.
The "Attention" Trap: Some tools just looked at where the robot was "looking" (attention). But sometimes, the robot looks at a word just to keep its balance, not because it's important. It's like a person staring at a clock while thinking about a birthday party; the clock isn't the reason for the party.
Missing the Context: Old tools often treated words as isolated islands. They didn't understand that the word "not" changes the meaning of the word "good" completely.

2. The CA-LIG Solution: A Layer-by-Layer Tour

CA-LIG is different because it acts like a guided tour guide who walks you through every single floor of the robot's brain.

Step 1: The "What If" Game (Integrated Gradients)

Imagine you have a sentence: "The movie was amazing."
CA-LIG plays a game of "What if?" It slowly turns the sentence into gibberish (like "The movie was... uh...") and watches how the robot's confidence changes at every single step.

If the robot's confidence drops when it removes "amazing," then "amazing" is a key ingredient.
It does this for every single floor of the building, not just the top one. This shows how the meaning of "amazing" evolves as it travels up the layers.

Step 2: The "Who is Talking to Whom" Map (Attention Gradients)

Transformers work by having different parts of the sentence talk to each other. CA-LIG doesn't just look at who is talking; it looks at how much that conversation matters for the final decision.

It combines the "What If" scores with the "Who is talking" map.
The Analogy: Imagine a courtroom. The "What If" game tells you which witness is important. The "Attention" map tells you which witness the judge is listening to. CA-LIG combines these to tell you exactly which testimony convinced the judge.

Step 3: The Context Filter

This is the secret sauce. CA-LIG understands that words depend on their neighbors.

If the robot sees "The movie was not amazing," CA-LIG knows that "not" flips the meaning. It captures this relationship.
It creates a map where some words are Green (helping the decision) and some are Red (hurting the decision).

3. What Did They Find?

The researchers tested this new detective on many different tasks:

Reading Reviews: When a robot said a movie was bad, CA-LIG showed it wasn't just the word "bad" that decided it. It was the combination of "lame," "poor acting," and "worst" working together.
Detecting Hate Speech: In a low-resource language (Amharic), CA-LIG could spot hate speech by seeing how abusive words were connected, even if the sentence structure was complex.
Looking at Pictures: They even used it on a robot that looks at images (like cats vs. dogs). Instead of just highlighting the whole cat, CA-LIG highlighted the ears and eyes (the important parts) and ignored the background.

4. Why Does This Matter?

Think of CA-LIG as a transparent window into the robot's mind.

Trust: If a doctor uses an AI to diagnose a disease, they need to know why the AI said "cancer." CA-LIG shows the evidence, so the doctor can trust the result.
Debugging: If the AI makes a mistake, CA-LIG shows you exactly which floor of the building went wrong, helping engineers fix it.
Fairness: It helps us see if the AI is being biased (e.g., judging a person based on their name rather than their words).

The Bottom Line

The authors built a tool that stops treating AI like a magic black box. Instead of just guessing why the robot made a choice, CA-LIG walks us through the robot's thought process, floor by floor, showing us exactly how it connected the dots. It's a huge step toward making AI not just smart, but also honest and understandable.

1. Problem Statement

Transformer models (e.g., BERT, GPT, Vision Transformers) have achieved state-of-the-art performance but remain fundamentally opaque ("black boxes"). Existing Explainable AI (XAI) methods suffer from three critical limitations:

Final-Layer Bias: Most methods (e.g., standard Integrated Gradients) generate explanations only at the final output layer, ignoring how semantic information and contextual abstractions evolve through intermediate layers.
Lack of Unified Reasoning: Current approaches typically capture either local token-level salience (gradient-based) or global structural interactions (attention-based), failing to integrate both into a single coherent representation.
Insufficient Context Awareness: Existing methods often neglect inter-token dependencies, residual connections, and the flow of information across layers, which are central to Transformer reasoning.

Consequently, current explanations often fail to faithfully represent how evidence accumulates and how structural components shape decision-making.

2. Methodology: The CA-LIG Framework

The authors propose the Context-Aware Layer-wise Integrated Gradients (CA-LIG) framework. This is a unified, hierarchical attribution method that computes relevance at every Transformer block and fuses it with attention signals. The framework consists of four stages:

A. Layer-wise Integrated Gradients (LIG)

Instead of attributing importance only at the classifier layer, CA-LIG computes Integrated Gradients (IG) at every intermediate Transformer block $l$ .

Process: For a target class $c$ , the method constructs a trajectory of interpolated hidden states between a baseline representation $x'(l)$ and the actual input representation $x(l)$ .
Calculation: It computes the gradient of the output score $y_c$ with respect to these interpolated states and aggregates them using a Riemann sum.
Output: A signed, layer-wise token relevance score $R^{(l)}_t$ that satisfies the completeness axiom (the sum of attributions equals the difference between the model's output and the baseline).

B. Attention Gradient Computation

To capture how tokens interact, the framework computes the gradient of the output score with respect to the attention matrix $A^{(b)}$ for each block $b$ .

Purpose: This generates a class-specific sensitivity map ( $\nabla A^{(b)}$ ) indicating how changes in attention weights between token pairs affect the prediction. This moves beyond raw attention weights (which are often unreliable) to measure the sensitivity of the decision to attention pathways.

C. Context-Aware Fusion

The framework fuses the local token relevance ( $R^{(l)}$ ) with the attention sensitivity ( $\nabla A^{(b)}$ ).

Mechanism: The token relevance scores are normalized and used to gate the attention gradients via an element-wise (Hadamard) product.
Formula: $R^{(b)}_{context} = \nabla A^{(b)} \odot \text{Norm}(R^{(l)})$ .
Result: This creates a relevance-gated sensitivity mechanism where attention pathways are weighted by the causal importance of the tokens involved.

D. Hierarchical Aggregation (Relevance Rollout)

To produce a final unified map, the framework recursively multiplies the normalized, fused relevance-weighted attention matrices across all $B$ blocks.

Equation: $C = \bar{A}^{(1)} \cdot \bar{A}^{(2)} \cdots \bar{A}^{(B)}$ .
Decomposition: The final matrix $C$ is decomposed into positive ( $C^+$ ) and negative ( $C^-$ ) components to distinguish supportive evidence from opposing evidence.
Tunability: A coefficient $\lambda$ balances the influence of attention gradients versus raw token relevance.

3. Key Contributions

Unified Hierarchical Framework: CA-LIG is the first framework to compute Integrated Gradients at every Transformer block, enabling the tracking of relevance evolution from input embeddings to the final decision.
Gradient-Attention Fusion: It introduces a novel mechanism to fuse layer-wise gradients with class-specific attention gradients, bridging local token importance with global structural dependencies.
Context-Aware Normalization: The method enforces relevance conservation and normalization across multi-head attention pathways, ensuring faithfulness to the model's internal reasoning.
Cross-Domain Validation: The framework is validated across diverse tasks (sentiment analysis, hate speech detection, document classification) and domains (NLP and Computer Vision) using various models (BERT, XLM-R, AfroLM, MAE).

4. Experimental Results

The authors evaluated CA-LIG against baselines including Integrated Gradients (IG), Layer-wise Relevance Propagation (LRP), Attention Rollout, and Input $\times$ Gradient (IxG).

Qualitative Analysis:
- Long-Range Dependencies: In the 20 Newsgroups dataset, CA-LIG successfully identified semantically related tokens separated by long distances (e.g., linking "evidence" to "bible" and "history"), whereas baselines often focused only on adjacent words or isolated keywords.
- Low-Resource Languages: In Amharic hate speech detection, CA-LIG provided stable and coherent attributions for morphologically rich languages where other methods struggled.
- Vision Tasks: Applied to Masked Autoencoders (MAE) on CIFAR-10 and ASIRRA, CA-LIG produced concentrated, semantically coherent heatmaps (focusing on object parts like eyes/muzzle) compared to the scattered noise of baseline methods.
- Sign vs. Noise: CA-LIG effectively distinguished between supportive (positive) and opposing (negative) evidence, whereas attention-based baselines often assigned uniform relevance.
Quantitative Analysis:
- Token-F1 Score: On the ERASER Movie Reviews benchmark, CA-LIG consistently achieved higher Token-F1 scores (overlap with human-annotated rationales) than all baselines across various token selection thresholds.
- Perturbation Robustness: In vision tasks, CA-LIG demonstrated higher Insertion AUC and lower Deletion AUC, indicating that its explanations are more faithful to the model's actual decision boundaries.
Layer-wise Sensitivity: A case study on BERT revealed that CA-LIG aligns with known linguistic hierarchies: early layers capture syntax, middle layers capture semantics, and deep layers consolidate decision-relevant cues. CA-LIG tracked these shifts, whereas final-layer methods missed the evolution of relevance.

5. Significance and Conclusion

The CA-LIG framework represents a significant advancement in XAI for Transformer models by addressing the "black box" nature of deep hierarchical architectures.

Faithfulness: By adhering to axiomatic properties (completeness) and tracking relevance through the entire network, it provides explanations that are more faithful to the model's internal logic.
Interpretability: It offers a granular view of how context is built, allowing researchers to debug models and understand why a prediction was made, rather than just what was predicted.
Generalizability: The framework's success across NLP and Vision tasks suggests that context-aware, hierarchical attribution is a universal requirement for explaining deep neural networks.

Limitations & Future Work: The current implementation is computationally heavier than final-layer methods (scaling with the number of layers) and is currently limited to encoder-only models. Future work aims to extend CA-LIG to decoder-based models (e.g., LLMs), multimodal transformers, and adaptive fusion mechanisms.