From Features to Actions: Explainability in Traditional and Agentic AI Systems

The Big Picture: From a Snapshot to a Movie

Imagine you are trying to understand why a car crashed.

Traditional AI is like looking at a single photograph of the car right after the crash. You can see the crumpled hood and the broken headlight. You can point to the damage and say, "Ah, the front bumper hit the tree." This is what current AI explainability tools (like SHAP or LIME) do. They look at the final answer and tell you which words or numbers in the input were most important for that specific result.

Agentic AI (the new kind of AI) is like a full movie of the car driving for an hour before it crashed. The car didn't just hit the tree; it took a wrong turn, ignored a stop sign, got confused by a detour, and then finally crashed. If you only look at the final photo (the crash), you miss the whole story. You don't know why the driver took that wrong turn in the first place.

This paper argues that our current tools for explaining AI are stuck in the "photograph" era, but we need to upgrade to the "movie" era to understand these new, complex AI agents.

The Problem: The "Snapshot" vs. The "Journey"

1. The Old Way (Static Predictions)
Think of a traditional AI like a fortune teller. You give it a crystal ball (the input), and it tells you your future (the output).

The Explanation: If the fortune teller says "You will lose your job," a traditional explanation tool looks at the crystal ball and says, "It was because you mentioned 'layoffs' and 'budget cuts'."
The Flaw: This works fine for a one-time prediction. But it doesn't explain how the fortune teller got there if they had to ask you five questions, check a newspaper, and call a friend first.

2. The New Way (Agentic Systems)
Think of a modern AI agent like a travel agent planning a complex trip for you.

The Process: The agent doesn't just give you a ticket. It searches for flights, checks your passport validity, books a hotel, realizes the hotel is full, switches to a different one, tries to book a car, fails because the credit card is declined, and then tries a different card.
The Failure: If the trip fails, it's not because of one bad word in your request. It's because the agent forgot your passport number in step 3, or it picked the wrong credit card in step 5.
The Gap: Traditional tools try to explain the failure by looking at your original request ("You said 'cheap flight'"). But the real problem happened three steps ago when the agent got confused. The old tools can't see the "movie" of the journey; they only see the "snapshot" of the final failure.

The Experiment: Testing the Tools

The researchers ran two experiments to prove their point:

Experiment A: The Static Test (The Photo)
They used a simple AI to classify job postings as "IT" or "Non-IT."

Result: The traditional tools worked great. They could consistently point out which words (like "software" or "accounting") decided the answer. It was like a reliable photo analysis.

Experiment B: The Agentic Test (The Movie)
They used advanced AI agents to perform complex tasks, like booking an airline flight or navigating a website.

Result: The traditional tools failed miserably. They couldn't tell you why the agent failed to book the flight. They couldn't see that the agent had forgotten the passenger's name in step 2, leading to a crash in step 10.
The New Solution: The researchers used a new method called "Trace-Based Diagnostics." Instead of looking at the input, they watched the entire movie (the execution trace). They created a checklist (a "rubric") to see exactly where the agent messed up.

The Key Discoveries

1. The "State Drift" Problem
In the airline booking task, the agents often failed because they lost track of their own "memory."

Analogy: Imagine a chef cooking a complex meal. They chop the onions, then walk away to answer the phone. When they come back, they forget they already chopped the onions and chop them again, or they forget to add salt because they lost their place in the recipe.
Finding: The agents didn't fail because they were "dumb"; they failed because they lost track of the state of the world. Traditional tools couldn't see this "forgetfulness."

2. The "Wrong Turn" Problem
In the web-navigation task, agents failed because they picked the wrong tool immediately.

Analogy: Imagine trying to open a door. If you pick the wrong key (the wrong tool) on the first try, you can't open the door, no matter how hard you try later.
Finding: These were "fast failures." One wrong decision early on doomed the whole mission.

3. The "Minimal Explanation Packet" (MEP)
The authors propose a new standard for explaining AI. Instead of just giving a reason, we need a packet that includes:

The Artifact: The explanation (e.g., "The agent failed").
The Evidence: The proof (e.g., "Here is the log showing the agent forgot the passenger's name at 2:03 PM").
The Verification: A check to make sure the explanation is true (e.g., "We replayed the video, and yes, the agent did forget the name").

Why This Matters

If you are a doctor using an AI to diagnose a patient, or a bank using an AI to approve loans, you don't just want to know what the AI decided. You need to know how it got there.

Old AI: "I denied your loan because your credit score was low." (Snapshot)
New AI (with this paper's method): "I denied your loan because I tried to call your bank, got a timeout error, assumed you had no income, and then made a decision based on that wrong assumption. Here is the log of the error." (Movie)

The Takeaway

We are moving from an era of Static Predictors (AI that answers a single question) to Agentic Systems (AI that takes actions over time).

To trust these new agents, we can't just look at the final answer. We need to watch the whole movie, check the script, and verify the actor's memory at every step. The paper provides the tools to build that "movie camera" for AI, ensuring that when things go wrong, we know exactly where and why the plot twisted.

1. Problem Statement

The paper addresses a fundamental gap in Explainable AI (XAI). Traditional XAI methods (e.g., SHAP, LIME, saliency maps) were designed for static prediction settings, where the goal is to explain a single input-output mapping ( $y = f(x)$ ). These methods attribute importance to input features for a specific prediction.

However, the rise of Agentic AI (LLM-based agents that use tools, plan, and interact with environments over time) has shifted the paradigm. In agentic systems:

Success or failure is determined by a trajectory (a sequence of states, actions, and observations) rather than a single inference.
Failures often arise from compounding errors, state inconsistencies, or incorrect tool choices across multiple steps, not just a single "wrong" output.
Existing static XAI methods fail to localize where and why an agent failed within a multi-step execution, as they lack context regarding state evolution and tool interactions.

The core problem is that attribution-based explanations do not translate reliably to agentic settings, necessitating a new framework for trajectory-level explainability.

2. Methodology

The authors propose a unified framework to compare static and agentic explainability, introducing the Minimal Explanation Packet (MEP) and conducting empirical experiments across two paradigms.

A. Conceptual Framework: The Minimal Explanation Packet (MEP)

The paper defines the MEP as a method-agnostic unit that bundles three components to ensure explanations are actionable and auditable:

Explanation Artifact: The human-interpretable output (e.g., feature attribution map or reasoning trace).
Linked Evidence & Context: The grounding material (e.g., input text, execution logs, tool arguments, state snapshots).
Verification Signals: Indicators of reliability (e.g., perturbation stability for static models; rubric flags and replay consistency for agents).

B. Experimental Setup

The study compares two distinct settings using the MEP framework:

Static Setting:
- Task: Binary classification of online job postings (IT vs. non-IT).
- Models: TF-IDF + Logistic Regression and Text CNN.
- Methods: SHAP, LIME, and Saliency maps.
- Metric: Explanation stability (Spearman rank correlation of feature attributions under input perturbations).
Agentic Setting:
- Tasks: Two benchmarks: TAU-bench Airline (structured API tasks) and AssistantBench (open-ended web navigation).
- Models: LLM agents (GPT-4.1 and o4-mini) using tool-use frameworks (ReAct/CoT).
- Method: Trace-based Rubric Evaluation. Execution traces are analyzed post-hoc using Docent (an LLM-based judge) to assign binary violation/satisfaction labels against a set of behavioral rubrics.
- Rubrics: Intent Alignment, Plan Adherence, Tool Correctness, Tool-Choice Accuracy, State Consistency, and Error Recovery.

C. Bridging Experiment

To directly compare the paradigms, the authors encoded agent trajectories into binary feature vectors based on rubric violations. They then trained a logistic regression model to predict task success and applied SHAP to these rubric features to see if attribution methods could recover global importance rankings in an agentic context.

3. Key Contributions

Formal Distinction: A clear conceptual separation between Static Explainability (feature-level influence on a single prediction) and Agentic Explainability (trajectory-level accounts of decision sequences).
Taxonomy of Artifacts: A cross-paradigm taxonomy mapping explanation targets from "feature attributions" to "trajectory-level accounts," emphasizing the need for verification signals (faithfulness) in agentic systems.
The Minimal Explanation Packet (MEP): A standardized structure for packaging explanations with context and verification, shifting the focus from isolated narratives to auditable, evidence-grounded artifacts.
Empirical Evidence: A rigorous comparison demonstrating that attribution methods are insufficient for diagnosing execution-level failures in agents, while trace-based diagnostics are highly effective.

4. Key Results

A. Static Setting Findings

Stability: Attribution methods (SHAP/LIME) are stable in static settings. The TF-IDF + Logistic Regression model achieved a high stability score (Spearman $\rho$ = 0.86), indicating consistent feature rankings under perturbations.
Limitation: While stable, these methods only explain the final prediction and offer no insight into intermediate reasoning or decision dynamics.

B. Agentic Setting Findings

Failure Localization: Attribution methods applied to agentic trajectories cannot reliably localize specific execution-level failures.
Trace-Based Diagnostics: Rubric-based analysis successfully identified the root causes of failure:
- TAU-bench Airline: Failures were primarily driven by State Tracking Consistency. Violations were 2.7 $\times$ more prevalent in failed runs compared to successful ones. This indicates a "slow failure" pattern where state drift accumulates over time.
- AssistantBench: Failures were driven by sparse but decisive errors, specifically Tool Choice Accuracy. Violations here had an infinite ratio of occurrence in failed runs (Ratio = $\infty$ ), indicating a "fast failure" pattern where a single wrong tool choice collapses the task.
Predictive Power:
- State inconsistency reduced the probability of success by 49% in TAU-bench.
- Tool Choice Accuracy violations in AssistantBench resulted in 0% success rate.

C. Bridging Experiment Results

When agent trajectories were compressed into rubric features, SHAP could identify which behavioral dimensions (e.g., State Consistency) were globally correlated with failure.
Crucial Distinction: While SHAP provided aggregate correlations, it failed to provide trace-grounded diagnoses for specific runs. It could not explain how a specific state drift occurred in a specific trajectory, only that state drift generally correlates with failure.

5. Significance and Implications

Paradigm Shift: The paper argues for a fundamental shift in XAI from feature attribution to trajectory-level explainability. For agentic systems, the unit of explanation must be the entire sequence of decisions, not just the final output.
Actionable Diagnostics: Trace-grounded rubric evaluations provide diagnostically actionable insights (e.g., "the agent lost track of the booking ID at step 4"), which are essential for debugging, auditing, and improving autonomous agents.
Safety and Trust: In safety-critical domains (healthcare, finance), understanding how an agent failed over time is more important than knowing what the final wrong answer was. The MEP framework supports this by coupling explanations with verification signals.
Future Directions: The authors call for standardized frameworks for trajectory-level explanations, stronger verification mechanisms (e.g., counterfactual interventions), and the integration of these diagnostics into agent training loops.

In summary, the paper demonstrates that while traditional XAI is sufficient for static classifiers, it is inadequate for agentic AI. Effective explainability for agents requires trace-based, rubric-grounded diagnostics that capture the temporal and state-dependent nature of autonomous behavior.