Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning

Imagine you are a robot tasked with finding a specific item, like a toaster, in a house you've never seen before. The catch? You can't see the whole house at once. You can only see what's directly in front of your eyes, and many things are hidden behind doors, inside cabinets, or just out of frame.

This is the problem of Partially Observable Task and Motion Planning (PO-TAMP). It's like playing a game of hide-and-seek where you have to plan your entire route before you even know where everyone is hiding. If you guess wrong, you waste time walking to the wrong room, get stuck, and have to start your plan over.

The paper introduces a new system called CoCo-TAMP (Co-Location Task and Motion Planning) that solves this by giving the robot a "common sense" brain upgrade using a Large Language Model (LLM)—the same kind of AI that powers chatbots like me.

Here is how CoCo-TAMP works, explained through simple analogies:

1. The "Gut Feeling" (Initial Beliefs)

Without help, a robot might think a toaster is just as likely to be in the bathroom as it is in the kitchen. It starts with a blank slate.

CoCo-TAMP asks the LLM: "Where is a toaster most likely to be?"
The LLM uses its training on human language and culture to say, "Almost certainly the kitchen, maybe the dining room, but definitely not the bathroom."

The Analogy: Imagine you are looking for your lost keys. A robot without common sense would check every single drawer in the house with equal probability. CoCo-TAMP is like a friend who says, "You always leave your keys on the kitchen counter or by the front door. Let's check there first." This "gut feeling" helps the robot skip useless searches immediately.

2. The "Social Circle" (Co-Location)

The system also uses a second type of common sense: Similarity.
If the robot finds a coffee mug on the kitchen counter, it can guess that a coffee pot is likely nearby. If it finds a screwdriver, it might guess a hammer is in the same toolbox. Conversely, if it finds a toothbrush, it knows a toaster is probably not in that same spot.

The Analogy: Think of it like a high school cafeteria. If you see a group of football players sitting at a table, you can guess that the other football players are likely at that same table, while the chess club members are probably at a different table. CoCo-TAMP uses the "social circle" of objects to update its map. Finding one item instantly updates the robot's belief about where similar items might be hiding.

3. The "Smart Detective" (Hierarchical State Estimation)

The robot doesn't just guess; it keeps a running score (a "belief") of where things are.

Step 1: It uses the LLM to make an educated guess about the room and surface (e.g., "Kitchen, Counter").
Step 2: It moves to look. If it sees the object, great!
Step 3: If it doesn't see the object, it doesn't just give up. It asks: "Did I look hard enough? Was the view blocked?"
Step 4: If it finds a different object (like a toaster), it uses the "Social Circle" rule to update its guess about the original object (the coffee pot).
The Analogy: This is like a detective solving a mystery.
- Bad Detective: "I didn't see the suspect in the kitchen, so he must be in the garage." (Gives up too easily).
- CoCo-TAMP Detective: "I didn't see the suspect in the kitchen, but the kitchen was dark and I only looked at the counter. Also, I just found his favorite hat in the living room. Since he loves his hat, he's probably in the living room too. Let's go there."

Why is this a big deal?

The researchers tested this on a real robot (a Toyota HSR) and in massive computer simulations.

Without CoCo-TAMP: The robot wanders around aimlessly, checking the wrong rooms, getting confused, and having to restart its plan many times. It's slow and frustrating.
With CoCo-TAMP: The robot acts like an experienced human. It knows where to look first and uses clues from one object to find another.

The Results:

In simulations, it was 62% faster.
In the real world, it was 72% faster.

The Bottom Line

CoCo-TAMP teaches robots to stop thinking like blind machines and start thinking like humans who understand the world. By combining a robot's ability to move with an AI's ability to "know" how the world works (where things belong and what goes with what), robots can solve complex tasks much faster and with fewer mistakes.

It's the difference between a robot that blindly searches every closet in a house, and a robot that walks straight to the kitchen because it knows that's where the toaster lives.

Here is a detailed technical summary of the paper "Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning" (COCO-TAMP).

1. Problem Definition

The paper addresses Partially Observable Task and Motion Planning (PO-TAMP). In realistic robotic environments, robots often face uncertainty regarding object poses, occlusions, and the presence of objects that are not directly visible.

Challenge: Traditional TAMP solvers often assume full observability or rely on deterministic plans. When a robot executes a plan in a partially observable environment, it may encounter unexpected task-irrelevant objects or fail to find task-relevant objects because their locations are unknown.
Goal: To enable robots to efficiently plan and execute long-horizon manipulation tasks by maintaining accurate belief states (probabilistic estimates of object locations and poses) under partial observability, leveraging common-sense knowledge to guide the search.

2. Methodology: COCO-TAMP

The authors propose COCO-TAMP, a hierarchical planning and execution framework that integrates Large Language Models (LLMs) as a source of common-sense priors and co-location cues. The system operates within a hierarchical Bayesian filter and uses PDDLStream as the underlying TAMP planner.

A. Core Components

LLM-Guided Initial Belief Generation:
- Instead of assuming a uniform distribution over possible object locations, COCO-TAMP uses an LLM to generate prior beliefs ( $bel(x_{r,0}, x_{s,0})$ ) regarding which rooms and surfaces an object is likely to occupy.
- Mechanism: The LLM is queried via Multiple-Choice Question Answering (MCQA). The robot asks the LLM to select the most probable location from a list of options (e.g., "Where is a toaster likely to be? A: Kitchen, B: Bathroom...").
- Probability Derivation: The LLM's next-token prediction probabilities (logits) for the answer options are converted into valid probability distributions using a softmax function.
Hierarchical State Estimation (Belief Update):
- The system maintains beliefs over three levels: Room ( $x_r$ ), Surface ( $x_s$ ), and Pose ( $x_p$ ).
- Visibility-Aware Observation Model: The system accounts for partial observability (occlusions/limited field of view). It calculates a "visibility" metric ( $v \in [0,1]$ ) based on the fraction of particles (representing possible object poses) that are actually visible. A failed observation does not definitively rule out an object's presence if the area was not fully visible.
- Co-Location Model: This is the novel contribution for updating beliefs during execution.
  - Concept: Similar objects tend to be co-located (e.g., a spoon and a fork), while dissimilar objects are not.
  - Implementation: The system generates text descriptions of objects using the LLM, creates embeddings, and computes cosine similarity ( $sim(j, k)$ ).
  - Update Logic: If object $j$ is observed in a specific room, the belief that a similar object $k$ is in the same room increases. If they are dissimilar, the belief decreases.
  - Co-Location Toggler: To handle objects that are naturally dispersed (e.g., light switches in every room), an LLM query determines whether to activate the co-location model for a specific object type.
Planning and Execution Loop:
- The system uses a "generate and verify" loop. The LLM provides priors, but a belief-space planner (PDDLStream) computes the optimal path.
- Cost Function: The cost of the detect action is inversely proportional to the current belief mass and visibility. This steers the planner toward "informative views" (locations where the robot is most likely to find the object).
- Replanning: If execution fails (e.g., an object is not found where predicted), the system updates the belief state using the observation model and triggers replanning.

3. Key Contributions

Interleaved Framework: Proposes a PO-TAMP framework that interleaves planning and execution, using LLMs to provide informative priors and guide belief updates, making long-horizon planning practical under uncertainty.
Co-Location Modeling: Introduces a method to leverage LLM embeddings to model semantic co-location, allowing the robot to infer the likely location of unobserved objects based on the observation of related objects.
Robustness to Adversarial Settings: Demonstrates that while LLM priors are helpful, relying solely on LLMs for belief updates (without Bayesian filtering) is insufficient for robustness. The hybrid approach (LLM priors + Bayesian updates) maintains performance even when common-sense assumptions are violated.
Efficiency: Significantly reduces the computational burden of PO-TAMP by narrowing the search space for object locations early in the process.

4. Experimental Results

The authors evaluated COCO-TAMP in both large-scale simulations (using the Housekeep dataset) and real-world experiments (using a Toyota HSR robot).

Metrics: Cumulative planning/execution time and the number of replanning iterations.
Simulation Results:
- Compared to a baseline without common-sense knowledge, COCO-TAMP achieved an average 62.7% reduction in planning and execution time.
- It significantly reduced the number of replanning iterations, indicating fewer execution failures.
- Ablation Study: The combination of MCQA (LLM priors) and the Co-Location Model performed best.
  - Using only LLM priors (MCQA) improved efficiency over the baseline.
  - Using only the Co-Location Model improved robustness.
  - Using LGBU (LLM-generated belief updates without Bayesian filtering) performed poorly in adversarial settings, failing 3 out of 5 times when object placements were randomized to disrupt common sense.
Real-World Results:
- In a mock apartment environment, the baseline took 365 seconds.
- The Co-Model variant took 112 seconds.
- The full MCQA + Co-Model (COCO-TAMP) took 100 seconds, representing a 72.6% reduction in time compared to the baseline.

5. Significance and Conclusion

This work bridges the gap between the semantic reasoning capabilities of LLMs and the rigorous probabilistic requirements of robotic planning.

Practical Impact: It demonstrates that robots can operate more efficiently in cluttered, partially observable home environments by "thinking" like humans (using common sense) while still adhering to mathematical guarantees (Bayesian filtering).
Future Directions: The authors note that while the system works well for household domains, future work will explore applying these priors to non-household domains (factories, hospitals) and handling scenarios where the semantic layout of the environment itself is unknown.

In summary, COCO-TAMP proves that integrating LLM-derived common sense into a principled belief-space planner creates a robust, efficient system for solving complex, long-horizon robotic manipulation tasks under uncertainty.

Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning

1. The "Gut Feeling" (Initial Beliefs)

2. The "Social Circle" (Co-Location)

3. The "Smart Detective" (Hierarchical State Estimation)

Why is this a big deal?

The Bottom Line

1. Problem Definition

2. Methodology: COCO-TAMP

A. Core Components

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

PACED: Distillation at the Frontier of Student Competence

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA