Unsupervised Discovery of Failure Taxonomies from Deployment Logs

Imagine you own a fleet of delivery robots, self-driving cars, or smart vacuum cleaners. They are out in the real world, doing their jobs. But sometimes, they crash, drop a package, or get stuck.

In the past, if a robot failed, a human engineer would have to watch the video of the crash, write down what happened, and try to figure out why. If you had 10,000 crashes, you'd need a team of people working for years just to sort through the logs. It's like trying to find a specific typo in a library of a million books by reading every single page.

This paper introduces a smart, automated way to solve that problem. Here is how it works, broken down into simple concepts:

1. The Problem: The "Needle in a Haystack"

Robots fail in messy, unpredictable ways. One robot might drop a cup because the floor was slippery; another might drop it because it was holding it too tight. If you just look at the raw video data, these look like thousands of different, unrelated accidents.

The goal is to stop looking at them as "10,000 separate mistakes" and start seeing them as "5 main types of mistakes." This list of mistake types is called a Taxonomy (think of it like a library's Dewey Decimal System for robot failures).

2. The Solution: The "AI Detective"

The authors built a system that acts like a super-smart detective. It doesn't need a human to tell it what to look for; it figures it out on its own. It works in three steps:

Step 1: The Highlight Reel (Downsampling)
Imagine a 30-minute video of a robot failing. Most of it is boring (the robot just walking). The failure happens in one second.
The system uses a "smart highlighter." It scans the video and only keeps the frames where things actually change or where the action gets interesting. It throws away the boring parts so the AI doesn't get overwhelmed.
Step 2: The Interview (Reasoning)
The system takes these "highlight reels" and asks a powerful AI (a Vision-Language Model) to act like a detective.
- The AI looks at the video.
- It asks itself: "What happened here? Why did the robot drop the pot? Was the floor wet? Did it slip? Did it grab the wrong handle?"
- The Result: Instead of just a video, the system now has a written story explaining the failure. "The robot dropped the pot because it tried to lift it by the handle, but the handle was too slippery."
Step 3: The Grouping Party (Clustering)
Now, the system has thousands of these written stories. It reads them all and starts grouping them together based on the meaning, not just the words.
- It puts all stories about "slippery handles" in one pile.
- It puts all stories about "misjudging narrow doorways" in another pile.
- It puts all stories about "running out of battery" in a third pile.
The result is a neat, organized list of failure categories (a Taxonomy) with names like "Slippery Grip Failures" or "Narrow Passage Confusion."

3. Why This Matters: Two Superpowers

Once the robot knows its "Top 5 Ways to Fail," it can do two amazing things:

A. The Early Warning System (Runtime Monitoring)
Imagine the robot is driving down the street. The system is watching it in real-time.

Old way: The robot just drives until it crashes.
New way: The system sees the robot approaching a glass door. It remembers, "Oh! We have a category called 'Glass Door Confusion' where robots often crash into invisible walls."
Action: The system shouts, "Stop! This looks like a Glass Door Confusion!" and triggers a safety brake before the crash happens. It's like a co-pilot who knows exactly where the car usually gets into trouble.

B. The Targeted Tutor (Better Training)
If you want to teach a robot to be better, you shouldn't just show it random videos. You should show it the specific things it is bad at.

Old way: Collect 1,000 random videos of robots walking.
New way: The system says, "We have a huge pile of 'Narrow Passage Confusion' failures. Let's go film 500 more videos specifically of robots trying to squeeze through tight hallways."
Result: The robot learns much faster because it's practicing exactly what it's bad at.

The Big Picture

This paper is about moving from reactive (fixing things after they break) to proactive (understanding why they break so we can stop it from happening again).

Instead of a human manually sorting through a mountain of crash videos, this AI automatically organizes the chaos into a clear, understandable manual of "How Robots Fail." This manual helps engineers build safer, smarter robots that learn from their mistakes much faster.

Here is a detailed technical summary of the paper "Unsupervised Discovery of Failure Taxonomies from Deployment Logs."

1. Problem Definition

As autonomous systems (e.g., self-driving cars, household robots) are deployed in open, dynamic environments, they encounter diverse, unstructured scenarios leading to failures. While these failures provide rich data for improving robustness, manually analyzing large-scale failure logs is impractical and unscalable.

The paper introduces the problem of Unsupervised Failure Taxonomy Discovery. The goal is to automatically derive a semantically coherent and actionable taxonomy of failure modes directly from raw, multimodal deployment logs (perceptual trajectories) without:

Predefined failure labels.
Human annotation.
Isolated episode-level analysis.

Instead of treating failures as isolated events, the objective is to identify recurring failure patterns and organize them into interpretable categories that explain why failures occur across different episodes.

2. Methodology

The proposed framework operates in three primary stages, transforming raw perceptual data into a structured failure taxonomy (see Fig. 1 in the paper):

A. Semantic Observation Downsampling

To handle long trajectories efficiently while preserving causal context, the method performs frame-level similarity-based downsampling centered on the failure event.

Mechanism: It uses CLIP embeddings to measure cosine similarity between frames.
Process: Starting from the failure frame, it selects frames bidirectionally (backward and forward) only when the semantic content changes significantly (exceeding a threshold $\tau$ ).
Result: This creates a compressed subsequence that retains critical transitions leading to the failure and immediate consequences, removing temporal redundancy while fitting within Vision-Language Model (VLM) context windows.

B. Failure Reasoning (Structured Explanation)

The downsampled sequences are fed into a Vision-Language Model (VLM) with a structured prompt.

Strategy: The model employs a Chain-of-Thought (CoT) approach to analyze the scene, agent behavior, and interactions.
Output: It generates a structured natural language explanation ( $r_n$ ) inferring the plausible cause of the failure, grounded in the observed evidence.

C. Failure Taxonomy Discovery (Semantic Aggregation)

The system clusters the generated failure explanations into a taxonomy using Large Language Models (LLMs) as optimizers.

Clustering Objectives: Maximize intra-cluster semantic coherence, minimize inter-cluster overlap, and ensure comprehensive coverage.
Ensemble-and-Refine Strategy: To ensure robustness against prompt sensitivity, the LLM generates multiple candidate taxonomies via diverse rephrasings. These candidates are then aggregated into a single, consolidated taxonomy that resolves inconsistencies, merges overlapping categories, and unifies labels.
Assignment: Finally, each trajectory is mapped to the discovered clusters. Outliers (rare or ambiguous failures) are flagged for future refinement.

3. Key Contributions

Problem Formulation: Defined the novel task of unsupervised failure taxonomy discovery from multimodal trajectories.
Framework: Proposed a pipeline that extracts structured failure explanations via VLMs and aggregates them into semantic failure modes using LLM-based clustering.
Validation & Utility: Demonstrated that the discovered taxonomies are consistent, interpretable, and provide measurable closed-loop safety benefits in targeted data collection and runtime failure monitoring.

4. Experimental Results

The framework was evaluated across three domains: Robot Manipulation (RoboFail dataset), Autonomous Driving (Nexar car crash videos), and Indoor Navigation (Stanford office simulation).

A. Manipulation (RoboFail)

Reasoning Accuracy: Using Gemini 2.5 Pro, the system achieved a 0.60 Cosine Similarity and 0.76 LLM-Judge score against expert annotations, outperforming specialized fine-tuned models (e.g., AHA-13B) and other VLMs.
Taxonomy Recovery: The method achieved a Taxonomy Coverage (TC) of 1.0 and a Semantic Alignment Score (SAS) of 0.958, significantly outperforming BERTopic baselines. It successfully recovered distinct root causes (e.g., "Manipulation & Control Failures" vs. "Planning Errors").
Assignment: Achieved a weighted F1 score of 85.53% for mapping episodes to clusters, far superior to embedding-similarity baselines (32.41%).

B. Driving & Navigation

Driving: Discovered clusters (e.g., "Rear-End Collisions," "Unsafe Cut In") aligned closely with the U.S. DoT Volpe Center's pre-crash typology without any predefined labels.
Navigation: Identified geometric failure causes (e.g., "Thin Protruding Objects," "Featureless Surfaces") that matched previously manually identified failure types.

C. Downstream Safety Applications

Runtime Failure Monitoring:
- Augmenting a VLM-based anomaly detector with the discovered failure taxonomy improved F1 scores (e.g., 71.4% vs. 54.1% for car crashes on In-D data).
- The taxonomy-guided monitor showed superior generalization on Out-of-Distribution (OOD) data compared to supervised classifiers.
- It enabled earlier detection (Lead Time) by correlating unfolding observations with known failure modes.
Targeted Data Collection:
- Using the taxonomy to guide data collection (focusing on high-risk regions like "featureless walls") reduced failure rates from 46% to 18% after fine-tuning.
- In contrast, uniform data collection only reduced failure rates to 34%, proving the efficiency of failure-guided sampling.

5. Significance and Impact

Scalability: Eliminates the bottleneck of manual log analysis, enabling the processing of massive deployment datasets.
Interpretability: Moves beyond "black box" clustering to provide natural language, human-readable failure categories.
Closed-Loop Safety: Demonstrates a practical pathway for using unsupervised discovery to actively improve system safety through:
- Proactive Monitoring: Early warning systems based on known failure patterns.
- Efficient Learning: Directing data collection efforts to the most critical, underrepresented failure modes.
Generalization: The approach is domain-agnostic, working effectively across manipulation, driving, and navigation without task-specific engineering.

Limitations & Future Work

The authors note that there is no single "canonical" taxonomy, and VLMs may occasionally generate plausible but incorrect explanations. Future work aims to incorporate causal validation, simulation-based verification, and formal safety analysis (e.g., STPA) to further ground the discovered modes. The modular design suggests potential for scaling to even larger, temporally extended logs.