Proactive Rejection and Grounded Execution: A Dual-Stage Intent Analysis Paradigm for Safe and Efficient AIoT Smart Homes

Imagine you have a super-smart but slightly scatterbrained personal assistant named "Alex" who lives in your house. Alex is powered by a giant brain (a Large Language Model) that knows everything about the world, but Alex has never actually seen your house.

When you tell Alex, "Turn on the reading lamp in the bedroom and lock the front door," Alex might get confused. Maybe your house doesn't have a reading lamp, or maybe the "front door" is actually a sliding glass door.

In the past, if you asked a smart home system to do this, two bad things would happen:

The "Hallucination" Problem: Alex would just guess. If there's no reading lamp, Alex might pretend there is one and try to turn on a light that doesn't exist, or worse, turn on the wrong light in the wrong room.
The "Nagging" Problem: To be safe, other systems would constantly stop and ask you, "Which lamp? Is it the one in the living room or the bedroom?" This turns a simple command into a frustrating conversation.

The paper you shared introduces a new system called DS-IA (Dual-Stage Intent-Aware). Think of it as giving Alex a two-step security guard and a strict checklist before they are allowed to touch anything in your house.

Here is how it works, using simple analogies:

The Two-Stage Process

Stage 1: The "Bouncer" at the Door (Global Intent Analysis)

Imagine a bouncer at a nightclub. Before you even get to the dance floor (your house), you have to show your ID.

What it does: When you give a command, this "Bouncer" looks at your house's current map (the "Environment Snapshot").
The Magic: If you say, "Turn on the kitchen dehumidifier," but the Bouncer looks at the map and sees no dehumidifier in the kitchen, they immediately stop you. They say, "Sorry, that device doesn't exist. I'm rejecting this request."
Why it's great: It stops the system from wasting time trying to do impossible things. It filters out "fake" requests before they even get to the action phase.

Stage 2: The "Strict Inspector" (Grounded Execution)

If the Bouncer lets you through, you move to the second stage: a very strict inspector who checks every single step against a physical checklist.

The Checklist: The inspector checks three things for every action:
1. Room Check: Does this room actually exist?
2. Device Check: Is the device actually in that room?
3. Capability Check: Can this device actually do what you asked? (e.g., Can a lamp "lock" a door? No.)
The "Mixed" Command Solution: This is the coolest part. Imagine you say: "Turn on the bedroom lamp (which exists) AND turn on the kitchen heater (which doesn't exist)."
- Old systems would either fail completely or try to turn on the wrong heater.
- DS-IA says: "Okay, I can do the lamp. But the heater? I'll mark that as 'Failed' and tell you, but I'll still turn on the lamp." It doesn't drop the whole task; it just fixes the broken part.

Why This Changes Everything

The paper tested this new system against the old ways (like the "SAGE" system) and found two huge wins:

1. It Stops the "Fake" Actions (Safety)
Old systems often tried to "force" a solution. If you asked for a non-existent device, they might pretend it was a similar device and turn that on instead. This is dangerous (imagine turning on a stove when you meant to turn on a fan).

The Result: DS-IA acts like a "Semantic Firewall." It rejected invalid instructions 87% of the time, whereas the old system only did it 14% of the time. It refuses to lie to you.

2. It Stops the "Nagging" (Efficiency)
Old systems were so scared of making mistakes that they asked you questions constantly. "Which lamp?" "Which door?"

The Result: DS-IA is smart enough to look at the house state and figure it out on its own. It went from solving tasks on its own 43% of the time to 71% of the time. It only asks you for help when it truly doesn't know, making the experience feel much more seamless.

The Bottom Line

Think of DS-IA as upgrading your smart home assistant from a confident but reckless teenager (who guesses and nags) to a professional butler (who checks the inventory, refuses impossible orders politely, and executes the possible ones perfectly without bothering you).

It bridges the gap between "talking" and "doing" by ensuring that every action the AI takes is grounded in the reality of your home, not just a guess in its head.

1. Problem Statement

The integration of Large Language Models (LLMs) into Internet of Things (IoT) smart homes faces two critical bottlenecks when transitioning from information providers to Embodied Agents:

The Reliability Gap (Entity Hallucinations): Direct execution of LLM-generated commands often leads to "forced grounding," where the model hallucinates non-existent devices or capabilities (e.g., trying to control a "kitchen dehumidifier" that doesn't exist) because LLMs lack strict constraints from the physical environment.
The Interaction Frequency Dilemma: Existing iterative frameworks (like SAGE) struggle to balance "silent execution" with "active questioning."
- Conservative strategies frequently ask users for clarification, disrupting the convenience of smart homes.
- Aggressive strategies execute recklessly to minimize interaction, leading to safety risks.
- Current reactive "think-while-acting" loops lack the global context to distinguish between preference ambiguities (requiring user input) and invalid entity errors (requiring rejection), leading to task omission or forced hallucinations.

2. Methodology: The DS-IA Framework

The authors propose the Dual-Stage Intent-Aware (DS-IA) framework, which adopts an "Analyze-then-Act" paradigm. It decouples high-level intent understanding from low-level physical execution.

Stage 1: Global Intent Analysis (The Semantic Firewall)

Function: Acts as a proactive router that analyzes user instructions against the current environmental state snapshot ( $S_t$ ).
Mechanism: It classifies instructions into three categories:
1. $C_{valid}$ : All entities exist; proceed to execution.
2. $C_{invalid}$ : Explicitly targets non-existent devices; triggers Early Rejection.
3. $C_{mixed}$ : Contains a mix of valid and invalid sub-tasks.
Goal: Filter out impossible commands before they reach the expensive code generation stage, preventing the model from entering a "blind exploration" loop.

Stage 2: Grounded Execution & Cascade Verification

Generation: For valid/mixed intents, the LLM generates a raw candidate action sequence ( $A_{raw}$ ) using In-Context Learning, guided by the diagnostic reasoning from Stage 1.
Three-Level Cascade Verifier: A strict, deterministic rule checker validates every atomic action ( $a_k$ $a_{k}$ ) in three sequential levels:
1. Spatial Topology ( $V_R$ ): Does the target room exist?
2. Entity Alignment ( $V_D$ ): Does the specific device exist in that room?
3. Affordance ( $V_C$ ): Does the device support the requested capability?
Mixed Intent Resolution (Generate-and-Filter): For $C_{mixed}$ tasks, the system does not fail the whole sequence. Instead, it replaces invalid actions with a standardized error token ( $\epsilon_{err}$ ) while preserving valid actions. This prevents Task Omission (dropping valid sub-tasks due to one error) and Forced Hallucination.

3. Key Contributions

Proactive Paradigm Shift: Moves from reactive "error-then-fix" loops to a proactive "Analyze-then-Act" approach, explicitly decoupling macro-intent routing from micro-action execution.
Dual-Stage Architecture with Cascade Verification: Introduces a "Semantic Firewall" (Stage 1) and a "Room-Device-Capability" cascade check (Stage 2) to ensure physical grounding.
Generate-and-Filter Strategy: A novel method for handling mixed intents that maximizes task recall by executing valid sub-tasks while safely bypassing invalid ones, avoiding the "all-or-nothing" failure mode.
State-Aware Disambiguation: Resolves the Interaction Frequency Dilemma by inferring intent from the environmental snapshot, reducing unnecessary user queries while maintaining high precision in identifying irreducible ambiguities.

4. Experimental Results

The framework was evaluated on HomeBench (robustness/safety) and the SAGE Benchmark (interaction efficiency).

HomeBench (Physical Safety & Robustness)

Exact Match (EM) Rate: DS-IA achieved 58.56%, outperforming the Baseline (29.98%) and SAGE (1.77%).
Rejection of Invalid Instructions: DS-IA achieved an 87.04% rejection rate for invalid single instructions (IS), compared to 14.07% for the Baseline and 29.84% for SAGE. This proves the effectiveness of the Early Rejection mechanism in preventing hallucinations.
Mixed Tasks: DS-IA significantly improved F1 scores on mixed tasks (77.42% vs. 33.23% for Baseline), demonstrating its ability to handle complex, multi-step instructions with partial validity.

SAGE Benchmark (Interaction Efficiency)

Autonomous Success Rate: DS-IA increased the rate of resolving tasks without user intervention from 42.86% (SAGE) to 71.43%. This indicates a massive reduction in unnecessary user disturbance.
Clarification Success Rate: Maintained at 75.00%, showing the system still correctly identifies when human help is truly needed.
Persistence: Achieved 100% success on persistence tasks (long-term state monitoring), whereas SAGE struggled (25.00%) due to context forgetfulness.

Efficiency Analysis

Computational Savings: By rejecting invalid intents in Stage 1, DS-IA reduced the number of expensive autoregressive code generation calls by 18.1% and saved over 427,000 tokens in Stage 2, trading cheap prefill tokens for expensive decoding tokens.

5. Significance

This paper presents a critical step forward in making LLM-based smart home agents safe, reliable, and efficient.

Safety First: It establishes a "Do No Harm" protocol by ensuring that hallucinated commands are physically impossible to execute, addressing the primary barrier to deploying AI in safety-critical IoT environments.
User Experience: By resolving the "Interaction Frequency Dilemma," it creates a more seamless user experience where the agent acts autonomously when possible but asks for help only when strictly necessary.
Scalability: The modular design (Intent Analysis + Cascade Verification) offers a blueprint for future embodied agents, suggesting that separating semantic reasoning from physical constraints is essential for robust real-world deployment.

The authors have open-sourced their codebase, environment snapshots, and evaluation scripts to facilitate further research in embodied AI and IoT safety.