SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts

This paper introduces SmartBench, the first dataset designed to evaluate LLMs on detecting anomalous device states and behavioral contexts in smart homes, revealing that current state-of-the-art models struggle significantly with this critical task.

Qingsong Zou, Zhi Yan, Zhiyao Xu, Kuofeng Gao, Jingyu Xiao, Yong Jiang

Published 2026-03-10
📖 5 min read🧠 Deep dive

Imagine your smart home is like a very polite, highly educated butler who has read every book in the library. You can tell him, "Turn on the lights," or "Set the thermostat to 72 degrees," and he does it perfectly. He understands your habits and your voice.

But here's the problem: What happens when the butler is asleep at the wheel?

What if the heater is blasting while the air conditioner is freezing the room? What if the front door is wide open while you're at work? What if the kitchen faucet has been running for three hours, and the butler just stands there watching the water bill skyrocket?

This is the story of SmartBench, a new study that tests whether our "super-smart" AI butlers can actually wake up and notice when something is wrong in the house.

The Big Idea: The "Butler" vs. The "Detective"

For a long time, researchers have been teaching AI (Large Language Models, or LLMs) to be butlers. They are great at following orders: "Play jazz," "Order pizza," "Lock the door."

But a true smart home needs a detective too. It needs to look around and say, "Hey, wait a minute. The window is open, it's raining, and the AC is on. That doesn't make sense!"

The authors of this paper realized that while AI is great at following instructions, it's terrible at spotting these weird, dangerous, or wasteful situations. So, they built a test called SmartBench to see how good the AI really is at being a detective.

The Test: A "Find the Glitch" Game

To test the AI, the researchers created a massive video game-like dataset with 4,400 different scenarios. They split the test into two levels:

  1. The "Snapshot" Level (Context-Independent):
    Imagine taking a single photo of your entire house at 3:00 PM. The AI has to look at the photo and say, "Is anything weird here?"

    • Example: The photo shows the heater is ON and the AC is ON.
    • The AI's Job: Spot the conflict immediately.
  2. The "Movie" Level (Context-Dependent):
    Imagine watching a 10-minute video of the house. The AI has to watch the story unfold.

    • Example: The video shows you leaving the house at 8:00 AM. At 8:05 AM, the front door unlocks. At 8:10 AM, the kitchen faucet turns on and stays on.
    • The AI's Job: Connect the dots. "The house is empty, but the door opened and water is running. That's a problem!"

The Results: The AI is Still a Rookie

The researchers tested 13 of the smartest AI models available (including big names like GPT-5, Claude, and Gemini) on this test. The results were... not great.

  • The "Blind Spot": Most AIs failed to spot the anomalies. They were like a butler who sees the door open but thinks, "Oh, that's fine," even though you told them you were away.
  • The "Confused Detective": Even when an AI did spot a problem, it often couldn't explain why. It might say, "Something is wrong," but fail to say, "The faucet is running because the door opened while the house was empty."
  • The "False Alarm" Problem: Some AIs were so paranoid they thought everything was broken. They would scream "FIRE!" when it was just a sunny day. This is bad because if your smart home cries wolf too often, you'll stop listening to it.

The Verdict: The best AI only got about 66% to 79% of the answers right. In the real world, if your security system is wrong 30% of the time, it's useless.

Why is this so hard? (The "Lost in the Middle" Effect)

Think of the "Movie Level" test like reading a 500-page novel and being asked, "What was the twist in chapter 12?"

AI models are great at reading the first page and the last page, but they often get lost in the middle. When the story of the house gets long (with hundreds of events happening over time), the AI forgets the beginning. It forgets that you left the house at 8:00 AM by the time it gets to the faucet turning on at 8:41 AM.

What Does This Mean for You?

This paper isn't saying AI is useless. It's saying that we aren't ready to let AI run the house alone yet.

Right now, if you buy a smart home system, the AI is a great assistant for doing things (turning on lights), but it's a terrible guardian for watching things (spotting leaks or break-ins).

The Takeaway:
We need to build better "detective" brains for our AI butlers. Until then, don't rely on your smart home to tell you if you left the stove on. You still need to check it yourself!

SmartBench is like a report card for the future of smart homes, and right now, the AI is getting a "C-". It has potential, but it needs to study harder before it can be trusted to keep our homes safe.