HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Imagine you've just bought a super-smart robot butler. It can cook, clean, and even fold your laundry. But there's a catch: your house is chaotic. There are kids running around, hot stoves, fragile vases, and slippery floors. Unlike a factory where everything is predictable, your home is a wild card.

The problem is, these robots are great at following instructions but terrible at "common sense." They might put a metal spoon in the microwave (thinking it's just a container) or walk straight into a wall because they didn't see the chair moved.

This paper introduces a solution to keep your robot from accidentally burning down your house or breaking your grandma's vase. It does two main things:

1. The "Driving Test" for Robots: HomeSafe-Bench

Before you let a robot loose in your home, you need to test if it's safe. Currently, most tests are like a written exam: "Here's a picture of a stove. Is it safe?"

But real life is a movie, not a photo. A robot needs to understand motion and timing.

The authors created HomeSafe-Bench, which is like a massive, high-stakes driving test for robots.

The Course: They built 438 different "danger scenarios" in six rooms (kitchen, bedroom, bathroom, etc.).
The Simulation: They didn't just film actors; they used advanced AI to generate videos of robots doing dangerous things (like dropping a heavy pot or walking into a fire).
The Scoring: It's not just about spotting the danger. It's about when you spot it.
- Green Zone: You see the danger coming early and stop the robot. (Perfect!)
- Yellow Zone: You see it, but it's a bit late. (Okay, but risky.)
- Red Zone: The robot has already crashed. (Fail.)

They tested many of the world's smartest AI models on this test. The results were surprising:

Some "giant" AI models were actually terrible at spotting danger in real-time. They were too slow or too confident, often hallucinating dangers that weren't there (like stopping the robot because it thought it saw a ghost).
Smaller, faster models were actually better at spotting immediate threats.

2. The Solution: The "Dual-Brain" System (HD-Guard)

Since no single AI model is perfect at everything (being fast and being smart), the authors built a new safety system called HD-Guard.

Think of this system as a two-person security team working together:

🧠 The "Fast Brain" (The Reflex)

Who it is: A small, super-fast AI.
Job: It watches the video stream like a hawk. It doesn't think deeply; it just reacts.
Analogy: Imagine a reflex. If you touch a hot stove, you pull your hand away before you even feel the pain. The Fast Brain does this. It looks at the screen and says:
- 🟢 Green: "All clear, keep going."
- 🟡 Yellow: "Wait, something looks weird. Slow down and let the boss check."
- 🔴 Red: "CRASH IMMINENT! STOP EVERYTHING!" (It hits the emergency brake instantly).

🧠 The "Slow Brain" (The Expert)

Who it is: A massive, super-smart AI with deep knowledge of physics and common sense.
Job: It only wakes up when the Fast Brain says "Yellow."
Analogy: This is the detective. When the Fast Brain flags a weird situation, the Slow Brain zooms in. It asks: "Is that a sealed plastic box? Is it going into a microwave? Oh no, that will explode!" It uses deep logic to confirm if it's actually dangerous.

How They Work Together

The magic is in the handoff:

The Fast Brain watches everything at high speed.
If it sees something obvious (like a robot falling), it stops the robot immediately (Red).
If it's unsure (Yellow), it pauses the robot and asks the Slow Brain, "Hey, is this actually dangerous?"
The Slow Brain takes a few seconds to think, analyzes the physics, and gives a final verdict.

Why is this better?

Speed: The Fast Brain catches immediate crashes instantly.
Smarts: The Slow Brain catches tricky, hidden dangers (like the microwave explosion) that a simple reflex would miss.
Balance: It solves the problem of being too slow (missing a crash) or too dumb (stopping for no reason).

The Big Takeaway

The paper shows that to make robots safe in our messy homes, we can't just rely on one giant, slow brain. We need a hybrid system: a fast reflex to catch immediate dangers and a smart brain to understand complex situations.

It's like having a bodyguard who is fast enough to tackle a threat before it happens, but smart enough to know when a threat is actually just a harmless shadow. This "Dual-Brain" approach is the key to letting robots into our homes without us worrying they'll turn our kitchen into a disaster zone.

Here is a detailed technical summary of the paper "HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios."

1. Problem Statement

The rapid deployment of embodied agents (robots) in unstructured household environments introduces significant safety risks that differ from controlled industrial settings. Current safety evaluation frameworks suffer from three critical limitations:

Static and Text-Centric: Existing benchmarks often rely on static images or text-only inputs, failing to capture the continuous, dynamic nature of physical hazards.
Lack of Specificity: General hazard datasets (e.g., ASIMOV-v2) focus on broad human safety rather than specific failure modes of embodied agents (e.g., perception latency, lack of physical common sense).
Coupled Evaluation: Some benchmarks (e.g., IS-Bench) tightly couple safety perception with action planning, preventing the independent evaluation of Vision-Language Models (VLMs) as standalone safety monitors.

Consequently, there is a lack of a dedicated framework to evaluate how well VLMs detect unsafe agent actions in real-time household scenarios, and no architecture effectively balances the trade-off between low-latency response and deep multimodal reasoning.

2. Methodology

The paper introduces two core components: a new benchmark dataset (HomeSafe-Bench) and a novel detection architecture (HD-Guard).

A. HomeSafe-Bench (The Benchmark)

HomeSafe-Bench is a challenging dataset designed to stress-test VLMs on unsafe action detection.

Construction Pipeline: A hybrid approach combining Large Language Models (LLMs) for scenario generation, physical simulation (BEHAVIOR platform), and advanced video generation (Veo-3.1) to ensure both physical accuracy and visual realism.
Scale & Diversity: Contains 438 video cases across 6 household functional areas (bedroom, bathroom, living room, dining room, study, balcony).
Multidimensional Annotations:
- Temporal: Annotated with four critical timestamps: Intent Onset, Point-of-No-Return (PNR), Intervention Deadline (200ms before PNR), and Impact.
- Cognitive Difficulty: Categorized into D1 (Perceptual), D2 (Physical), and D3 (Causal/Hidden) reasoning.
- Severity: Graded L1–L4 based on NEISS guidelines (ranging from minor property damage to fatality).
- Hazard Categories: Distinguishes between mechanical (C1), cutting/piercing (C2), thermal/electrical (C3), and environmental damage (C4).
Quality Control: Rigorous dual-annotation process with high inter-annotator agreement (Cohen's $\kappa$ and Lin's CCC) and re-annotation of 238 conflicting samples.

B. HD-Guard (The Solution)

Hierarchical Dual-Brain Guard for Household Safety is a streaming architecture designed for real-time monitoring.

FastBrain (Lightweight): Uses a small, high-frequency VLM (e.g., MiniCPM-o-4.5) to process frames at up to 10 FPS. It classifies safety states into a traffic-light system:
- Green: Safe (low sampling rate).
- Yellow: Potential risk (triggers high sampling rate and alerts SlowBrain).
- Red: Imminent danger (immediate hardware stop).
SlowBrain (Heavyweight): Uses a large-scale VLM (e.g., Qwen3-VL-30B) for deep multimodal reasoning. It is triggered only on "Yellow" states to perform Chain-of-Thought (CoT) analysis on physics, intent, and latent hazards.
Integration Strategy: The system operates asynchronously. The FastBrain maintains real-time supervision. If the SlowBrain is computing, the FastBrain retains the authority to override with a "Red" alert if conditions worsen, ensuring minimal latency for immediate dangers while leveraging deep reasoning for complex cases.

3. Key Contributions

HomeSafe-Bench: The first dedicated benchmark for evaluating VLMs on unsafe action detection in household embodied agents, featuring 438 diverse, physically accurate, and visually realistic cases with fine-grained temporal and semantic annotations.
HD-Guard Architecture: A novel hierarchical dual-brain system that decouples rapid perception from deep reasoning, achieving an optimal trade-off between inference latency and detection accuracy.
Comprehensive Analysis: Extensive evaluation revealing that current VLMs frequently miss critical visual entities, exhibit weak temporal grounding, and struggle with causal reasoning. The paper provides a detailed error analysis across difficulty levels and severity categories.

4. Experimental Results

Benchmark Performance (Table 1)

Open-Source vs. Closed-Source: Open-source models (e.g., InternVL3.5-8B) outperformed leading closed-source models (e.g., GPT-5.1) in overall safety and detection sensitivity.
False Alarms: Top-performing models suffered from high "over-reaction" rates (premature warnings), making them impractical for real-world deployment due to operational costs.
Scaling Limits: Simply increasing model size did not guarantee better safety performance; smaller models often achieved better efficiency-weighted scores.

HD-Guard Performance

Efficiency-Accuracy Trade-off: HD-Guard achieved a 38% increase in Weighted Safety Score (WSS) compared to the standalone FastBrain while maintaining nearly identical latency (~3.1s).
Comparison: It operated 2x faster than the standalone SlowBrain (Qwen3-Omni) while achieving higher safety scores (24.94 vs. 19.35).
Error Reduction:
- Reduced Visual Entity Omissions from ~30% (baseline) to 0.5%.
- Eliminated Reasoning Deficits in hard tasks (D3), achieving 0% deficit rate compared to 45.6% for baseline models.
- Maintained a robust false alarm rate of 25.1%, outperforming GPT-5.1 (29.9%).

Ablation Studies

Sampling Frequency: The optimal sampling rate was found to be 5 FPS. Lower rates (1 FPS) missed transient hazards, while higher rates (10 FPS) introduced redundant noise without proportional safety gains.

5. Significance and Future Directions

Practical Deployment: This work bridges the gap between theoretical VLM capabilities and the rigorous safety requirements of household robotics. HD-Guard provides a viable blueprint for real-time safety monitors that do not sacrifice speed for accuracy.
Identified Bottlenecks: The study highlights that current VLMs struggle with temporal grounding (predicting future states) and physical common sense (e.g., understanding thermodynamics or occlusion).
Future Work: The authors note that HD-Guard currently lacks long-context memory to track historical object states, which is a limitation for hazards that evolve slowly. Future iterations aim to integrate long-term memory to further reduce latency-induced failures.

In summary, HomeSafe-Bench establishes a new standard for safety evaluation in embodied AI, and HD-Guard demonstrates that a hierarchical, dual-brain approach is essential for deploying safe, reliable robots in complex, unstructured home environments.