HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

This paper introduces HomeSafe-Bench, a comprehensive benchmark for evaluating unsafe action detection in household scenarios using 438 diverse cases, and proposes HD-Guard, a hierarchical dual-brain architecture that effectively balances real-time inference efficiency with deep multimodal reasoning safety monitoring.

Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu

Published Fri, 13 Ma
📖 4 min read☕ Coffee break read

Imagine you've just bought a super-smart robot butler. It can cook, clean, and even fold your laundry. But there's a catch: your house is chaotic. There are kids running around, hot stoves, fragile vases, and slippery floors. Unlike a factory where everything is predictable, your home is a wild card.

The problem is, these robots are great at following instructions but terrible at "common sense." They might put a metal spoon in the microwave (thinking it's just a container) or walk straight into a wall because they didn't see the chair moved.

This paper introduces a solution to keep your robot from accidentally burning down your house or breaking your grandma's vase. It does two main things:

1. The "Driving Test" for Robots: HomeSafe-Bench

Before you let a robot loose in your home, you need to test if it's safe. Currently, most tests are like a written exam: "Here's a picture of a stove. Is it safe?"

But real life is a movie, not a photo. A robot needs to understand motion and timing.

The authors created HomeSafe-Bench, which is like a massive, high-stakes driving test for robots.

  • The Course: They built 438 different "danger scenarios" in six rooms (kitchen, bedroom, bathroom, etc.).
  • The Simulation: They didn't just film actors; they used advanced AI to generate videos of robots doing dangerous things (like dropping a heavy pot or walking into a fire).
  • The Scoring: It's not just about spotting the danger. It's about when you spot it.
    • Green Zone: You see the danger coming early and stop the robot. (Perfect!)
    • Yellow Zone: You see it, but it's a bit late. (Okay, but risky.)
    • Red Zone: The robot has already crashed. (Fail.)

They tested many of the world's smartest AI models on this test. The results were surprising:

  • Some "giant" AI models were actually terrible at spotting danger in real-time. They were too slow or too confident, often hallucinating dangers that weren't there (like stopping the robot because it thought it saw a ghost).
  • Smaller, faster models were actually better at spotting immediate threats.

2. The Solution: The "Dual-Brain" System (HD-Guard)

Since no single AI model is perfect at everything (being fast and being smart), the authors built a new safety system called HD-Guard.

Think of this system as a two-person security team working together:

🧠 The "Fast Brain" (The Reflex)

  • Who it is: A small, super-fast AI.
  • Job: It watches the video stream like a hawk. It doesn't think deeply; it just reacts.
  • Analogy: Imagine a reflex. If you touch a hot stove, you pull your hand away before you even feel the pain. The Fast Brain does this. It looks at the screen and says:
    • 🟢 Green: "All clear, keep going."
    • 🟡 Yellow: "Wait, something looks weird. Slow down and let the boss check."
    • 🔴 Red: "CRASH IMMINENT! STOP EVERYTHING!" (It hits the emergency brake instantly).

🧠 The "Slow Brain" (The Expert)

  • Who it is: A massive, super-smart AI with deep knowledge of physics and common sense.
  • Job: It only wakes up when the Fast Brain says "Yellow."
  • Analogy: This is the detective. When the Fast Brain flags a weird situation, the Slow Brain zooms in. It asks: "Is that a sealed plastic box? Is it going into a microwave? Oh no, that will explode!" It uses deep logic to confirm if it's actually dangerous.

How They Work Together

The magic is in the handoff:

  1. The Fast Brain watches everything at high speed.
  2. If it sees something obvious (like a robot falling), it stops the robot immediately (Red).
  3. If it's unsure (Yellow), it pauses the robot and asks the Slow Brain, "Hey, is this actually dangerous?"
  4. The Slow Brain takes a few seconds to think, analyzes the physics, and gives a final verdict.

Why is this better?

  • Speed: The Fast Brain catches immediate crashes instantly.
  • Smarts: The Slow Brain catches tricky, hidden dangers (like the microwave explosion) that a simple reflex would miss.
  • Balance: It solves the problem of being too slow (missing a crash) or too dumb (stopping for no reason).

The Big Takeaway

The paper shows that to make robots safe in our messy homes, we can't just rely on one giant, slow brain. We need a hybrid system: a fast reflex to catch immediate dangers and a smart brain to understand complex situations.

It's like having a bodyguard who is fast enough to tackle a threat before it happens, but smart enough to know when a threat is actually just a harmless shadow. This "Dual-Brain" approach is the key to letting robots into our homes without us worrying they'll turn our kitchen into a disaster zone.