Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

Imagine you've built a super-smart, autonomous robot assistant. This isn't just a chatbot that answers questions; it's an "Agentic AI." It can plan a trip, book flights, order groceries, and even write code to fix a bug on its own. It thinks, it acts, and it uses tools.

But here's the problem: Robots are messy. Sometimes they get confused, sometimes they break the tools they use, and sometimes they just stop working entirely.

This paper is like a detective's case file on why these robot assistants fail. The authors didn't just guess; they looked at over 13,000 real-world complaints (like bug reports on GitHub) from 40 different robot projects. They zoomed in on 385 specific failures to figure out exactly what went wrong, how it showed up, and why it happened.

Here is the breakdown of their findings, explained simply with some analogies.

1. The Big Picture: It's a "Hybrid" Disaster

Traditional software (like a calculator) fails because of a typo in the code. Pure AI (like a creative writer) fails because it hallucinates nonsense.

Agentic AI is a hybrid. It's like hiring a genius architect (the AI) who has to work with a clumsy construction crew (the software tools).

The architect might give a perfect plan.
But if the crew doesn't have the right hammer (a library update), or if the architect forgets to check the blueprint (a token limit), the whole building collapses.

The paper found that failures happen at the intersection of these two worlds. It's not just "bad code" or "bad AI"; it's the messy handshake between them.

2. The Taxonomy: The "Five Rooms" of Failure

The authors organized all the failures into five main "rooms" where things go wrong. Think of the robot as a house:

Room 1: The Brain (Cognition & Orchestration)
- What happens: The robot's "brain" (the Large Language Model) gets confused. Maybe it's talking to the wrong person, or it's trying to speak a language the tool doesn't understand.
- Analogy: The architect gives the crew a blueprint written in a dead language. The crew tries to build it anyway and fails.
- Common issues: Wrong settings, expired passwords, or the robot getting stuck in an infinite loop of thinking.
Room 2: The Hands (Tooling & Actuation)
- What happens: The robot tries to use a tool (like a web browser or a database) but uses it wrong.
- Analogy: The robot tries to open a door with a spoon instead of a key. Or it tries to plug a US charger into a UK socket.
- Common issues: Wrong API calls, permission denied, or connecting to the wrong server.
Room 3: The Memory (Perception & Context)
- What happens: The robot forgets what it just did, or it remembers things that never happened.
- Analogy: You're having a conversation, and halfway through, the robot forgets your name or thinks you said something you didn't.
- Common issues: Losing track of the conversation history, saving data to the wrong file, or mixing up time zones.
Room 4: The Foundation (Runtime & Environment)
- What happens: The robot can't even start because the "ground" it's standing on is shaky.
- Analogy: The house is built on a swamp. The robot tries to run, but the floorboards (dependencies) are missing or rotting.
- Common issues: Missing software libraries, incompatible operating systems, or installation errors. (This was the #1 cause of failure!)
Room 5: The Dashboard (Reliability & Observability)
- What happens: The robot breaks, but the dashboard says "Everything is fine!"
- Analogy: The car engine is on fire, but the "Check Engine" light is broken. You don't know it's broken until the car explodes.
- Common issues: Bad error messages, missing logs, or the robot hiding its mistakes.

3. The "Domino Effect" (Fault Propagation)

The most interesting part of the paper is how they tracked how a small mistake turns into a big disaster. They used a method called "Association Rule Mining" (basically, looking for patterns like "If X happens, Y usually follows").

They found some near-perfect domino chains:

The Token Trap: If the robot sees a "Token Invalid" error, it is almost 100% certain that the code responsible for refreshing passwords is broken.
The Time Traveler: If the robot messes up a date or time, it's almost always because the code mixed up "naive" time (no time zone) with "aware" time (with time zone).
The Memory Leak: If the robot starts acting weird after a few hours, it's usually because it forgot to clean up its memory, causing a slow crash.

The Lesson: You don't need to guess. If you see Symptom A, you can almost immediately check Cause B.

4. Did Real Developers Agree?

The authors didn't just sit in a lab; they asked 145 real developers who build these robots.

The Verdict: The developers said, "Yes, this is exactly what we deal with every day."
The Score: They rated the paper's findings a 4 out of 5 on relevance.
The Feedback: The developers added a few missing pieces, like "Multi-agent coordination" (when two robots argue with each other) and "Human-in-the-loop" (when a human has to approve a robot's action).

5. Why Does This Matter?

Before this paper, debugging an AI robot was like trying to fix a car while it's driving at 100mph in the dark. You didn't know which part was broken.

This paper gives us a map and a flashlight.

It tells us where to look: Don't just blame the AI; check the dependencies and the memory.
It tells us how to fix it: If the robot is stuck in a loop, check the "stop" button. If it's crashing on install, check the library versions.
It tells us to build better: We need to build robots that are better at logging their mistakes and handling the messy real world.

The Bottom Line

Agentic AI is powerful, but it's fragile. It's a mix of smart thinking and clumsy engineering. By understanding the specific ways these systems break (from bad passwords to missing files), we can stop treating them like magic black boxes and start treating them like the complex, hybrid machines they really are. This makes them safer, more reliable, and easier to fix when they inevitably go wrong.

Here is a detailed technical summary of the paper "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes."

1. Problem Statement

Agentic AI systems represent a paradigm shift from traditional deterministic software and passive conversational LLMs. They combine large language model (LLM) reasoning with tool invocation, long-horizon control, and autonomous state management. While increasingly deployed in mission-critical domains (e.g., autonomous engineering, robotics), these systems face unique reliability challenges.

The Core Problem:

Hybrid Failure Modes: Failures in agentic systems do not stem solely from code bugs or model hallucinations. They arise from the complex interaction between probabilistic LLM outputs, deterministic orchestration logic, state persistence, and external tool APIs.
Lack of Empirical Understanding: Existing literature focuses on task-level outcomes or high-level behavioral errors. There is a critical gap in understanding how faults originate within specific architectural components, how they propagate across system layers, and what their observable symptoms are.
Debugging Difficulties: The "black box" nature of LLMs combined with fragile dependency ecosystems makes diagnosing failures difficult, often leading to silent errors, cascading failures, and security vulnerabilities.

2. Methodology

The authors conducted a large-scale empirical study using a mixed-method approach involving data mining, qualitative analysis, and quantitative validation.

Data Collection:
- Source: 40 open-source Agentic AI repositories (Python-based) with >1,000 stars and >30 issues.
- Dataset: 13,602 closed issues and merged pull requests were initially filtered.
- Refinement: A multi-stage filtering process (keyword-based, manual annotation, and GPT-4.1 validation) reduced the dataset to 385 representative faults for in-depth analysis.
- Sampling: Stratified random sampling was used to ensure representation across four architectural categories: Frameworks, Libraries, Tools, and Applications.
Analysis Techniques:
- Grounded Theory: The 385 faults were manually analyzed using open, axial, and selective coding to inductively derive taxonomies of fault types, symptoms, and root causes.
- Association Rule Mining: The Apriori algorithm was applied to identify statistically significant relationships (high lift and confidence) between fault types, symptoms, and root causes to map propagation patterns.
- Developer Validation: A structured survey was conducted with 145 practitioners (industry and academia) to validate the taxonomy's ecological validity and practical relevance.

3. Key Contributions

The paper makes three primary contributions:

A Comprehensive Fault Taxonomy: A hierarchical taxonomy comprising 5 architectural dimensions, 13 symptom classes, and 12 root cause categories, grounded in real-world data.
Fault Propagation Analysis: Identification of statistically significant, high-lift association rules that reveal how faults traverse architectural boundaries (e.g., from token management to authentication failures).
Empirical Validation: Strong evidence from a developer study confirming that the derived taxonomy accurately reflects the real-world experiences and mental models of agentic AI developers.

4. Key Results

A. The Taxonomy of Faults

The study identified 37 fault categories organized into five high-level dimensions:

Agent Cognition & Orchestration (83 faults): Failures in decision-making and control flows.
- Key Issues: LLM misconfiguration, API incompatibility, token handling errors, and agent state inconsistency.
Tooling, Integration & Actuation (66 faults): Failures in translating plans into actions.
- Key Issues: API misuse, parameter mismatches, connection setup failures, and resource handling errors.
Perception, Context & Memory (72 faults): Failures in ingesting information and maintaining state.
- Key Issues: Memory persistence failures, type handling errors, and logic/constraint violations.
Runtime & Environment Grounding (87 faults): The most frequent dimension, highlighting ecosystem fragility.
- Key Issues: Dependency specification errors, import resolution failures, and platform compatibility mismatches.
System Reliability & Observability (67 faults): Failures in robustness and monitoring.
- Key Issues: Exception handling defects, misleading error messages, and UI visualization defects.

B. Observable Symptoms

The analysis identified 13 symptom classes, with Data & Validation Errors (20%) and Installation & Dependency Issues (13.3%) being the most prevalent. Notable agent-specific symptoms include "Role Deviation/Hallucination" and "Memory/State Deficiencies."

C. Root Causes

The 12 root cause categories revealed that the most common underlying reasons for failure are:

Dependency and Integration Changes (19.5%): Volatility in the external ecosystem (LLM APIs, libraries).
Data and Type Mismatch (17.6%): Conflicts between probabilistic LLM outputs and strict programmatic type systems.
LLM Behaviour and Interface Changes (13.1%): Unpredictable model updates and prompt sensitivity.

D. Fault Propagation Patterns (Association Rules)

Using Apriori mining, the study found that faults often propagate across layers in near-deterministic ways:

Token Invalidation: Strongly linked to faulty token refresh mechanisms (Lift = 181.5).
Timestamp Errors: Almost exclusively caused by naive datetime conversions (Lift = 121.0).
Memory Issues: Strongly correlated with improper state handling (Lift > 30).
Dependency Chains: Environment mismatches often lead to import failures, which then cause execution failures.
Observability Gap: Weak error handling and logging often obscure the root cause, turning simple implementation mistakes into difficult-to-diagnose systemic failures.

E. Developer Validation

The taxonomy was validated by 145 practitioners:

Relevance: Mean rating of 3.97/5 (Cronbach's $\alpha$ = 0.904).
Coverage: 83.8% of developers reported the taxonomy covered faults they had personally encountered.
Refinements: Qualitative feedback suggested adding distinctions for "semantic failures" (valid syntax but wrong meaning) and "multi-agent coordination" issues.

5. Significance and Implications

Shift from Ad Hoc to Structural Debugging: The study demonstrates that agentic failures are not random but follow structured, predictable propagation pathways. This enables the development of targeted debugging heuristics (e.g., checking token refresh logic immediately upon token invalidation symptoms).
Observability as a First-Class Requirement: The high frequency of "silent" failures and misleading error messages underscores that observability (structured logging, tracing, state snapshots) must be designed into agentic systems from the start, not added as an afterthought.
Ecosystem Fragility: The dominance of dependency and integration issues suggests that current agentic architectures are too tightly coupled with volatile external components. Future designs require stable abstraction layers to decouple agent reasoning from changing APIs.
New Engineering Paradigms: The findings suggest that traditional software engineering tools (unit tests, static analysis) are insufficient for the stochastic nature of LLMs. New approaches, such as probabilistic type systems and self-healing agents, are needed to handle the "dual nature" of agentic faults (deterministic code + probabilistic reasoning).

In conclusion, this paper provides the first comprehensive, empirically grounded framework for understanding, diagnosing, and mitigating faults in agentic AI systems, laying the foundation for more reliable and trustworthy autonomous agents.