Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning

This paper presents CoCo-TAMP, a hierarchical state estimation framework that leverages large language models to incorporate common-sense knowledge about object locations and co-occurrence, significantly reducing planning and execution time for robots operating in partially observable environments.

Yoonwoo Kim, Raghav Arora, Roberto Martín-Martín, Peter Stone, Ben Abbatematteo, Yoonchang Sung

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are a robot tasked with finding a specific item, like a toaster, in a house you've never seen before. The catch? You can't see the whole house at once. You can only see what's directly in front of your eyes, and many things are hidden behind doors, inside cabinets, or just out of frame.

This is the problem of Partially Observable Task and Motion Planning (PO-TAMP). It's like playing a game of hide-and-seek where you have to plan your entire route before you even know where everyone is hiding. If you guess wrong, you waste time walking to the wrong room, get stuck, and have to start your plan over.

The paper introduces a new system called CoCo-TAMP (Co-Location Task and Motion Planning) that solves this by giving the robot a "common sense" brain upgrade using a Large Language Model (LLM)—the same kind of AI that powers chatbots like me.

Here is how CoCo-TAMP works, explained through simple analogies:

1. The "Gut Feeling" (Initial Beliefs)

Without help, a robot might think a toaster is just as likely to be in the bathroom as it is in the kitchen. It starts with a blank slate.

CoCo-TAMP asks the LLM: "Where is a toaster most likely to be?"
The LLM uses its training on human language and culture to say, "Almost certainly the kitchen, maybe the dining room, but definitely not the bathroom."

  • The Analogy: Imagine you are looking for your lost keys. A robot without common sense would check every single drawer in the house with equal probability. CoCo-TAMP is like a friend who says, "You always leave your keys on the kitchen counter or by the front door. Let's check there first." This "gut feeling" helps the robot skip useless searches immediately.

2. The "Social Circle" (Co-Location)

The system also uses a second type of common sense: Similarity.
If the robot finds a coffee mug on the kitchen counter, it can guess that a coffee pot is likely nearby. If it finds a screwdriver, it might guess a hammer is in the same toolbox. Conversely, if it finds a toothbrush, it knows a toaster is probably not in that same spot.

  • The Analogy: Think of it like a high school cafeteria. If you see a group of football players sitting at a table, you can guess that the other football players are likely at that same table, while the chess club members are probably at a different table. CoCo-TAMP uses the "social circle" of objects to update its map. Finding one item instantly updates the robot's belief about where similar items might be hiding.

3. The "Smart Detective" (Hierarchical State Estimation)

The robot doesn't just guess; it keeps a running score (a "belief") of where things are.

  • Step 1: It uses the LLM to make an educated guess about the room and surface (e.g., "Kitchen, Counter").

  • Step 2: It moves to look. If it sees the object, great!

  • Step 3: If it doesn't see the object, it doesn't just give up. It asks: "Did I look hard enough? Was the view blocked?"

  • Step 4: If it finds a different object (like a toaster), it uses the "Social Circle" rule to update its guess about the original object (the coffee pot).

  • The Analogy: This is like a detective solving a mystery.

    • Bad Detective: "I didn't see the suspect in the kitchen, so he must be in the garage." (Gives up too easily).
    • CoCo-TAMP Detective: "I didn't see the suspect in the kitchen, but the kitchen was dark and I only looked at the counter. Also, I just found his favorite hat in the living room. Since he loves his hat, he's probably in the living room too. Let's go there."

Why is this a big deal?

The researchers tested this on a real robot (a Toyota HSR) and in massive computer simulations.

  • Without CoCo-TAMP: The robot wanders around aimlessly, checking the wrong rooms, getting confused, and having to restart its plan many times. It's slow and frustrating.
  • With CoCo-TAMP: The robot acts like an experienced human. It knows where to look first and uses clues from one object to find another.

The Results:

  • In simulations, it was 62% faster.
  • In the real world, it was 72% faster.

The Bottom Line

CoCo-TAMP teaches robots to stop thinking like blind machines and start thinking like humans who understand the world. By combining a robot's ability to move with an AI's ability to "know" how the world works (where things belong and what goes with what), robots can solve complex tasks much faster and with fewer mistakes.

It's the difference between a robot that blindly searches every closet in a house, and a robot that walks straight to the kitchen because it knows that's where the toaster lives.