TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions

This paper introduces TIMID, a weakly supervised video anomaly detection framework that leverages task and mistake prompts to detect complex, time-dependent errors in robot executions, addressing the limitations of existing models and out-of-the-box VLMs through a novel multi-robot simulation dataset for zero-shot evaluation.

Nerea Gallego (University of Zaragoza), Fernando Salanova (University of Zaragoza), Claudio Mannarano (University of Zaragoza, University of Torino), Cristian Mahulea (University of Zaragoza), Eduardo Montijano (University of Zaragoza)

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are watching a robot try to make a sandwich.

If the robot drops the bread, that's an easy mistake to spot. It's a physical glitch, like a car skidding on ice. But what if the robot does everything perfectly physically? It picks up the bread, grabs the cheese, and puts it on the plate. But here's the catch: it put the cheese on before it even took the bread out of the bag.

The robot didn't break anything, and every single movement looked smooth. But the order was wrong. This is what the paper calls a "time-dependent mistake." It's not about how the robot moves, but when it moves.

This paper introduces a new AI system called TIMID (Time-Dependent Mistake Detection) designed to catch these sneaky, logic-based errors that other robots miss.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Good Actor" Robot

Current robots are getting very good at moving. If you ask a robot to "put the ball in the box," it can do it. But if you have a complex rule like, "You must put the ball in the box ONLY AFTER you have visited the lion," older systems get confused.

They might see the robot visit the lion, then visit the box, then put the ball in. Perfect! But if the robot visits the box before the lion, a human would say, "Hey, that's wrong!" The robot, however, might just think, "I did the actions, so I'm good."

2. The Solution: The "Script Supervisor" (TIMID)

Think of TIMID as a Script Supervisor on a movie set.

  • The Input: You give the system three things:
    1. The Video: The footage of the robot working.
    2. The Script (Task): A text description of what should happen (e.g., "Visit the lion, then the ball").
    3. The "What-If" (Mistake): A text description of what shouldn't happen (e.g., "Visiting the ball before the lion").
  • The Magic: TIMID watches the video and compares it to the text rules. It doesn't just look for broken arms or dropped objects; it looks for bad timing.

3. How It Learns: The "Weak Teacher"

Usually, to teach an AI to spot mistakes, you need thousands of videos where someone has drawn a red line on the exact second the mistake happened. That takes forever to make.

TIMID uses a trick called Weak Supervision.

  • The Analogy: Imagine a teacher grading a 10-minute exam. Instead of marking every single wrong answer with a red pen, the teacher just writes "Fail" at the top of the paper if any mistake was made.
  • The Result: TIMID learns to find the specific moment the mistake happened just by knowing the whole video was "bad." It's like a detective who knows a crime happened in a room and has to figure out exactly when the thief entered, even if they only know the room was robbed.

4. The Training Ground: The Robot Playground

Real robots are expensive, and making them fail on purpose is hard. So, the authors built a virtual playground (a simulation) where they created thousands of robot scenarios.

  • They programmed robots to follow rules (like "Don't touch the lion and the ball at the same time").
  • They intentionally made robots break these rules to create "bad" videos.
  • They even filmed real robots doing the tasks to see if the AI learned the logic or just memorized the look of the simulation.

5. The Results: Why It Matters

The authors tested TIMID against other smart AI models (like giant language models that "see" videos).

  • The Big AI Models: They are great at saying, "That robot dropped a cup!" but terrible at saying, "That robot visited the lion before the ball." They get lost in the timeline.
  • TIMID: It is much faster and much better at spotting these timing errors. It understands the story of the task, not just the pictures.

The Bottom Line

This paper gives robots a new kind of "common sense" regarding time and order. Instead of just checking if a robot is moving correctly, TIMID checks if the robot is following the plot of the task.

It's the difference between a security camera that just records a fight (traditional AI) and a detective who watches the whole movie and says, "Wait a minute, the villain couldn't have been in the kitchen at 2:00 PM because he was still in the lobby at 1:55 PM!" (TIMID).

This is a huge step toward robots that can work safely and correctly in complex, real-world jobs where following the right order is just as important as doing the right action.