REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

This paper introduces REI-Bench, the first benchmark for evaluating robot task planning under vague referring expressions, revealing that such vagueness significantly degrades performance and demonstrating that a task-oriented context cognition approach effectively mitigates this issue to improve accessibility for non-expert users.

Chenxi Jiang, Chuhao Zhou, Jianfei Yang

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Here is an explanation of the paper REI-Bench, translated into simple, everyday language with some creative analogies.

The Big Problem: The Robot's "Mind Reading" Struggle

Imagine you have a super-smart robot butler. You tell it, "Please move the heavy stuff outside."

If you are standing in a kitchen where the only heavy thing is a giant pot, a human understands instantly: "Oh, they mean the pot." But for a robot, "heavy stuff" is a nightmare. Is it the pot? The bag of flour? The cast-iron pan?

The paper argues that while robots are getting great at following clear instructions (like "Move the pot"), they are terrible at following vague instructions (like "Move it" or "Move the heavy stuff"). This is a huge problem because real humans—especially the elderly, children, or people in a hurry—don't speak like robots. They use shortcuts, pronouns, and descriptions that rely on context.

The Solution: A New "Gym" for Robots (REI-Bench)

To fix this, the researchers built a new training ground called REI-Bench. Think of this as a "gym" for robot brains, but instead of lifting weights, the robots have to solve puzzles involving vague language.

They created a dataset of 2,700 scenarios based on real-life conversations. They tested the robots in three different "difficulty modes":

  1. The "Clear" Mode: The human says, "Move the pot." (Easy peasy).
  2. The "Mixed" Mode: The human says, "Move the pot," but then later says, "Now move it." (The robot has to remember what "it" refers to).
  3. The "Vague & Distracting" Mode: The human says, "Move the heavy thing," while the conversation is full of noise, like mentioning a person named "Apple" (who isn't a fruit) or talking about a "heavy" book that isn't the target.

The Result: When the instructions got vague, the robots' success rate crashed. Some failed 37% more often than when the instructions were clear. They started grabbing the wrong items, like picking up a plate instead of the pot because they couldn't figure out what "the heated one" meant.

Why Do Robots Fail? (The "Distraction" Analogy)

The researchers discovered that the robots aren't "dumb"; they just get distracted.

Imagine a student taking a math test.

  • Clear Instruction: "Solve for X." The student focuses on the math.
  • Vague Instruction: "Solve for the thing that makes the answer happy."

The robot's brain (the Large Language Model) tries to do two things at once:

  1. Understand the language (Figure out what "it" means).
  2. Plan the actions (Pick up, move, put down).

When the language is vague, the robot gets so stuck trying to figure out the meaning that it forgets how to plan the actions. It's like a driver trying to read a map while driving; they get confused and crash. The robot spends all its "brain power" guessing the word and runs out of power to actually move the object.

The Fix: "The Translator" (TOCC)

The paper proposes a clever, simple fix called TOCC (Task-Oriented Context Cognition).

Instead of asking the robot to "Guess the meaning AND plan the move" at the same time, TOCC splits the job into two steps, like a Translator and a Manager.

  1. Step 1: The Translator (Cognition): The robot first acts as a translator. It looks at the vague instruction ("Move the heavy stuff") and the conversation history, then rewrites it into a crystal-clear command: "Move the pot."
  2. Step 2: The Manager (Planning): Now, the robot takes this clear command and simply plans the moves. No guessing, no confusion.

The Analogy:
Think of it like a chef and a sous-chef.

  • Without TOCC: The chef tries to read a scribbled note from a customer ("Make the spicy red thing") while simultaneously chopping vegetables. They chop the wrong thing.
  • With TOCC: The sous-chef (Translator) reads the note, asks the customer for clarification, and writes a clear ticket: "Make the Spicy Red Chili." The chef (Planner) then just follows the clear ticket perfectly.

The Takeaway

This paper teaches us that to make robots useful for real people (like grandma or a toddler), we can't just give them smarter brains. We have to teach them to translate human vagueness into clear instructions before they try to act.

By adding this "Translator" step, the researchers made the robots significantly better at understanding us, proving that sometimes, the best way to help a robot is to help it understand what we really mean.