TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions

Imagine you are watching a robot try to make a sandwich.

If the robot drops the bread, that's an easy mistake to spot. It's a physical glitch, like a car skidding on ice. But what if the robot does everything perfectly physically? It picks up the bread, grabs the cheese, and puts it on the plate. But here's the catch: it put the cheese on before it even took the bread out of the bag.

The robot didn't break anything, and every single movement looked smooth. But the order was wrong. This is what the paper calls a "time-dependent mistake." It's not about how the robot moves, but when it moves.

This paper introduces a new AI system called TIMID (Time-Dependent Mistake Detection) designed to catch these sneaky, logic-based errors that other robots miss.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Good Actor" Robot

Current robots are getting very good at moving. If you ask a robot to "put the ball in the box," it can do it. But if you have a complex rule like, "You must put the ball in the box ONLY AFTER you have visited the lion," older systems get confused.

They might see the robot visit the lion, then visit the box, then put the ball in. Perfect! But if the robot visits the box before the lion, a human would say, "Hey, that's wrong!" The robot, however, might just think, "I did the actions, so I'm good."

2. The Solution: The "Script Supervisor" (TIMID)

Think of TIMID as a Script Supervisor on a movie set.

The Input: You give the system three things:
1. The Video: The footage of the robot working.
2. The Script (Task): A text description of what should happen (e.g., "Visit the lion, then the ball").
3. The "What-If" (Mistake): A text description of what shouldn't happen (e.g., "Visiting the ball before the lion").
The Magic: TIMID watches the video and compares it to the text rules. It doesn't just look for broken arms or dropped objects; it looks for bad timing.

3. How It Learns: The "Weak Teacher"

Usually, to teach an AI to spot mistakes, you need thousands of videos where someone has drawn a red line on the exact second the mistake happened. That takes forever to make.

TIMID uses a trick called Weak Supervision.

The Analogy: Imagine a teacher grading a 10-minute exam. Instead of marking every single wrong answer with a red pen, the teacher just writes "Fail" at the top of the paper if any mistake was made.
The Result: TIMID learns to find the specific moment the mistake happened just by knowing the whole video was "bad." It's like a detective who knows a crime happened in a room and has to figure out exactly when the thief entered, even if they only know the room was robbed.

4. The Training Ground: The Robot Playground

Real robots are expensive, and making them fail on purpose is hard. So, the authors built a virtual playground (a simulation) where they created thousands of robot scenarios.

They programmed robots to follow rules (like "Don't touch the lion and the ball at the same time").
They intentionally made robots break these rules to create "bad" videos.
They even filmed real robots doing the tasks to see if the AI learned the logic or just memorized the look of the simulation.

5. The Results: Why It Matters

The authors tested TIMID against other smart AI models (like giant language models that "see" videos).

The Big AI Models: They are great at saying, "That robot dropped a cup!" but terrible at saying, "That robot visited the lion before the ball." They get lost in the timeline.
TIMID: It is much faster and much better at spotting these timing errors. It understands the story of the task, not just the pictures.

The Bottom Line

This paper gives robots a new kind of "common sense" regarding time and order. Instead of just checking if a robot is moving correctly, TIMID checks if the robot is following the plot of the task.

It's the difference between a security camera that just records a fight (traditional AI) and a detective who watches the whole movie and says, "Wait a minute, the villain couldn't have been in the kitchen at 2:00 PM because he was still in the lobby at 1:55 PM!" (TIMID).

This is a huge step toward robots that can work safely and correctly in complex, real-world jobs where following the right order is just as important as doing the right action.

Here is a detailed technical summary of the paper "TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions."

1. Problem Statement

As robotic systems execute increasingly complex, multi-step tasks, the nature of failure has evolved. While traditional Video Anomaly Detection (VAD) focuses on low-level kinematic errors (e.g., collisions, slips) or explicit visual outliers (e.g., explosions), it struggles to identify time-dependent mistakes.

Time-Dependent Mistakes: These are high-level procedural violations where individual atomic actions are physically correct but violate temporal or logical constraints (e.g., performing step B before step A, or visiting two mutually exclusive zones simultaneously).
The Gap: Existing methods either rely on rigid, manually annotated task graphs (lacking flexibility) or require dense frame-level annotations (data scarcity). Furthermore, large Vision-Language Models (VLMs) often lack the explicit temporal reasoning required to detect these specific protocol violations without extensive fine-tuning.
Data Scarcity: There is a severe lack of datasets containing structured, high-level temporal errors in multi-robot scenarios, hindering the training of robust detection models.

2. Methodology: The TIMID Architecture

The authors propose TIMID, a VAD-inspired architecture designed to detect time-dependent mistakes using weak supervision (video-level labels only) while outputting frame-level predictions.

A. Input and Formulation

Inputs: A video sequence ( $F$ ), a textual task description ( $P$ ), and a textual description of the potential mistake ( $M$ ).
Goal: Learn a scoring function $f(F, P, M) \to \{\hat{y}_t\}$ to predict the presence of a mistake at each frame $t$ .
Mistake Modeling: Mistakes are categorized into:
1. Executional ( $M_{exec}$ ): Physical deviations (e.g., failed grasp).
2. Procedural ( $M_{proc}$ ): Temporal/logical violations modeled using Linear Temporal Logic (LTL). This allows the system to translate natural language prompts into formal constraints.

B. Architectural Components

The pipeline consists of four main modules:

Video Encoder: The video is split into non-overlapping fragments via a sliding window and processed by a pre-trained video backbone to extract high-level feature vectors ( $X$ ).
Temporal Context Module:
- Uses Positional Encoding (sinusoidal) and a learnable Gaussian-like prior to establish absolute temporal order and dynamic timing.
- Employs a dual-stream architecture:
  - Global Stream: Computes unmasked context (bidirectional).
  - Local Stream: Computes causal context (masked to prevent looking into the future).
- These streams are fused using a learnable scalar to balance global and causal information.
Semantic Alignment Module:
- Uses a pre-trained CLIP text encoder to embed task ( $P$ ) and mistake ( $M$ ) descriptions.
- Applies a Cross-Attention mechanism where video features act as Queries and text features act as Keys/Values. This aligns visual events with semantic rules, allowing the model to "attend" to specific spatiotemporal regions where violations occur.
Classifier: A linear projection maps the aligned representations to frame-level scores.

C. Training Strategy (Weak Supervision)

The model is trained using Multiple Instance Learning (MIL) with only video-level labels (Correct vs. Anomalous):

Loss Function: A joint loss $L = L_{bce} + L_{con}$ $L = L_{b ce} + L_{co n}$ .
- $L_{bce}$ (Binary Cross-Entropy): Uses a pooling strategy. For normal videos, it takes the max score (penalizing false alarms). For anomalous videos, it averages the top-k scores to capture the failure window.
- $L_{con}$ (Contrastive Loss): Clusters videos with similar labels in the feature space to separate normal and anomalous representations.

3. Key Contributions

TIMID Architecture: A novel framework that combines VAD techniques with semantic reasoning to detect high-level temporal errors using only coarse, video-level supervision.
New Multi-Robot Dataset:
- A formally generated simulation dataset featuring two robots and two objects (a lion plush and a green ball).
- Tasks: Mutual Exclusion (concurrency constraints) and Sequential Ordering (temporal constraints).
- Scale: Over 1,000 annotated simulated videos and 8 real-world robot videos for zero-shot sim-to-real evaluation.
- Annotation: Includes both video-level labels and fine-grained frame-level annotations (every 16 frames).
Empirical Validation: Demonstrated that out-of-the-box VLMs fail at temporal reasoning for robotics, whereas the proposed VAD-inspired approach succeeds.

4. Experimental Results

The authors evaluated TIMID against three baselines:

Auto-Encoder: Traditional reconstruction-based anomaly detection.
Qwen 2.5 (7B): A large VLM used in zero-shot and fine-tuned modes.
PEL4VAD: An existing VAD model adapted for robotics.

Key Findings:

Performance on Temporal Tasks: TIMID significantly outperformed all baselines on the Multi-Robot (Mutex and Ordering) datasets.
- Example (Ordering Task): TIMID achieved an F1 score of 41.98, compared to 18.92 for PEL4VAD and 14.08 for fine-tuned Qwen 2.5.
- VLMs struggled to maintain the historical context required to detect rule violations (e.g., "visited Lion before Ball").
Performance on Physical Tasks (BridgeData V2): On short-horizon physical errors, VLMs performed well, but TIMID remained competitive, proving its versatility.
Inference Speed: TIMID is extremely fast (0.02 minutes for the whole dataset) compared to VLMs (hundreds of minutes), making it suitable for real-time applications.
Sim-to-Real Transfer: In a zero-shot experiment (training on simulation, testing on real videos), TIMID maintained the highest performance (F1: 26.76) compared to baselines (F1 < 16%), demonstrating robustness to domain shift.
Ablation Studies: Confirmed that both the Temporal and Semantic modules are necessary; the joint model outperformed either module used in isolation.

5. Significance and Conclusion

This paper addresses a critical gap in robotic safety and verification: the inability to detect procedural failures that look physically correct but are logically wrong.

Shift in Paradigm: It moves away from rigid task graphs and dense annotations toward a flexible, weakly supervised approach driven by natural language prompts.
Limitations of VLMs: It highlights that while VLMs are powerful for semantic understanding, they currently lack the specific temporal reasoning mechanisms needed for strict robotic protocol enforcement without massive computational overhead.
Future Impact: The proposed dataset and architecture provide a foundation for developing autonomous systems that can self-diagnose complex procedural errors, a prerequisite for safe deployment in unstructured environments. Future work aims to extend the model to handle multiple concurrent anomalies and reduce reliance on anomalous training data via unsupervised techniques.