Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Imagine you are teaching a robot to clean a messy house. In the past, if you told the robot, "Put the toy car in the green box," and it tried to shove a giant teddy bear in there first, the robot would get stuck. It would say, "Oh no, the bear is in the way!" and then just try the exact same mistake again and again, forever. It had no "memory" of what went wrong, only a rigid set of instructions.

This paper introduces a new way to teach robots called Reflective Test-Time Planning. Think of it as giving the robot a "human-like brain" that can pause, think, learn from mistakes, and change its personality on the fly while it's working.

Here is how it works, broken down into three simple concepts:

1. The "Mental Sandbox" (Reflection-in-Action)

Before the robot actually moves its arm, it doesn't just pick the first idea that pops into its head. Instead, it runs a mental simulation.

The Analogy: Imagine you are packing for a trip. Instead of just shoving your biggest suitcase into the car trunk, you pause and think: "If I put the big suitcase here, will I be able to fit the golf clubs later? Maybe I should put the golf clubs in first."
How the Robot Does It: The robot generates several different ideas (e.g., "Put the car in the green box," "Put the car in the orange box," "Put the car on the shelf"). It uses a "judge" inside its brain to score each idea. It asks, "If I do this, will it work?" It picks the highest-scoring idea before it ever touches anything. This prevents it from making obvious mistakes right out of the gate.

2. The "Post-Game Review" (Reflection-on-Action)

Once the robot tries an action, it doesn't just move on. It stops and asks, "How did that actually go?"

The Analogy: Think of a coach watching a soccer player miss a penalty kick. The coach doesn't just say, "Okay, next time." The coach says, "You kicked too hard, and you aimed at the wrong corner. Next time, aim for the bottom left."
How the Robot Does It: After the robot tries to put an object in a box, it gets a "score" and a verbal explanation of why it succeeded or failed. It stores this lesson. If it tried to put a toy car in a box that was too small, it learns: "Oh, that box is too small. I won't try that again."

3. The "Hindsight Lookback" (Retro-Reflection)

This is the magic part. Sometimes, a mistake doesn't show up immediately. You might do something that seems fine at first, but it causes a disaster three steps later.

The Analogy: Imagine you are playing a video game. You pick up a shiny sword early on because it looks cool. Three levels later, you realize that sword is so heavy you can't jump over a wall, and you're stuck. A normal player would just keep trying to jump and fail. A reflective player looks back and says, "Wait, if I hadn't picked up that heavy sword, I would have made it. I need to change my strategy."
How the Robot Does It: The robot periodically looks back at its recent history. It asks, "Looking at where I am now, was that decision I made five minutes ago actually a good idea?" If the answer is "No," it re-evaluates that old decision and updates its brain to avoid that specific mistake in the future.

The Big Result: Learning While Doing

Most robots are like frozen statues: they are trained once, and then they just act out what they learned, even if they fail. If they fail, they fail the same way every time.

This new method turns the robot into a fluid learner. It is like a student taking a test who is allowed to:

Think of three answers before writing one down.
Check their work immediately after writing.
Realize, halfway through the test, that they misunderstood the first question, and adjust their strategy for the rest of the exam.

In short: This paper teaches robots to stop repeating their mistakes. By giving them the ability to simulate, critique, and look back at their own actions while they are working, they can solve complex, messy real-world problems much better than before. They don't just "do"; they "learn how to do" as they go.

1. Problem Statement

Embodied Large Language Models (LLMs) possess high-level task reasoning capabilities but suffer from a critical limitation: they act as static oracles. Once deployed, they cannot learn from failures or adapt their decision-making processes in real-time.

The Issue: Current approaches treat deployment as a sequence of independent trials. When an agent makes a mistake, it repeats the same error in subsequent attempts because it lacks a mechanism to update its internal beliefs or action policies based on execution outcomes.
The Gap: Existing methods either rely on verbal reflection (storing text critiques that do not update model parameters, making them brittle under distribution shifts) or internal world models (fixed dynamics models that may be inaccurate). Neither approach effectively combines pre-action simulation with post-execution learning to enable true "double-loop learning" (learning how to learn).

2. Methodology: Reflective Test-Time Planning

The authors propose Reflective Test-Time Planning, a framework that unifies two modes of human-like reflection during the deployment phase: Reflection-in-Action and Reflection-on-Action. The system utilizes three interacting LLMs: an Action Generator ( $\pi_\theta$ ), an Internal Evaluator ( $V_{\phi_i}$ ), and an External Evaluator ( $V_{\phi_e}$ ).

A. Reflection-in-Action (Pre-Execution)

This mode addresses uncertainty by simulating potential outcomes before acting.

Process: Instead of greedily selecting the first plausible action, the agent samples $N$ diverse candidate actions using high-temperature sampling.
Internal Scoring: The Internal Evaluator ( $V_{\phi_i}$ ) generates a natural language critique and a numerical score (0–100) for each candidate based on an "internal simulation" of the environment.
Selection: The agent executes the candidate with the highest score. This allows the agent to "mentally try out" options and avoid obvious pitfalls before physical execution.

B. Reflection-on-Action (Post-Execution)

This mode grounds the agent in reality by learning from actual execution outcomes.

External Reflection: After an action is executed, the External Evaluator ( $V_{\phi_e}$ ) analyzes the outcome (success/failure and visual feedback) to generate a critique and score.
Retrospective Reflection: To solve the temporal credit assignment problem (where an action seems successful initially but causes failure later), the system periodically re-evaluates past decisions with hindsight (e.g., at room transitions). This identifies if an early "good" move blocked future progress.
Test-Time Training: The verbal reflections and scores serve as self-supervised signals to update the models during deployment:
1. Internal Model Update: The Internal Evaluator ( $V_{\phi_i}$ ) is updated via Supervised Learning to align its pre-action predictions with the hindsight-corrected external reflections.
2. Policy Update: The Action Generator ( $\pi_\theta$ ) is updated via Policy Gradient (REINFORCE), using the retrospective scores as rewards to favor actions that lead to long-term success.

3. Key Contributions

Unified Framework: The first framework to seamlessly integrate reflection-in-action (simulation) and reflection-on-action (learning from execution) for embodied agents during test-time.
Double-Loop Learning: Moves beyond single-loop learning (optimizing actions) to double-loop learning (updating the underlying reasoning process and predictive assumptions).
Retrospective Reflection: Introduces a mechanism to re-evaluate past decisions with hindsight, enabling the agent to correct long-horizon planning errors that immediate feedback misses.
Self-Supervised Test-Time Training: Demonstrates that agents can update their own parameters (via LoRA or full weights) using verbal reflections generated by the agent itself, without human intervention or new labeled data.

4. Experimental Results

The framework was evaluated on two newly designed benchmarks:

Long-Horizon Household Tasks (BEHAVIOR-1K): Complex multi-room tasks involving fitting, selection, and preparation.
MuJoCo Cupboard Fitting: A controlled geometric task requiring precise placement of objects into compartments.

Key Findings:

Performance Gains: The full model achieved a 44.7% success rate on Fitting tasks, significantly outperforming baselines like PPO (0%), DreamerV3 (10.6%), and verbal reflection methods (ReflectVLM at 2.1%).
Ablation Studies:
- Removing either Reflection-in-Action (RIA) or Reflection-on-Action (ROA) caused significant performance drops, proving they are mutually dependent.
- Updating both the action policy and the internal reflection model is crucial; updating only one leads to suboptimal results.
Generalization: The method successfully generalized to the Habitat-Matterport 3D (HM3D) dataset (real-world photorealistic scenes) despite being trained on synthetic data, achieving a 19.5% success rate compared to 0% for several baselines.
Real-Robot Validation: Qualitative trials on a physical Franka Panda robot demonstrated the ability to recover from execution failures and correct earlier decisions.
Efficiency: While the method incurs a ~3x computational overhead due to candidate sampling and training, a "compute-matched" experiment showed that simply giving baselines more time (3x steps) did not improve their performance, confirming that the gains come from learning, not just more rollouts.

5. Significance

This paper fundamentally shifts the paradigm of embodied AI deployment from static inference to adaptive learning.

Robustness: It enables robots to recover from mistakes and adapt to novel environments without retraining from scratch.
Interpretability: By using verbal reflections as the learning signal, the decision-making process remains interpretable, allowing humans to understand why an agent changed its strategy.
Scalability: The use of LoRA for test-time training makes the approach parameter-efficient, suggesting a viable path for deploying self-improving robots in unstructured, real-world environments where failure is inevitable.

In summary, the authors demonstrate that by mimicking human reflective practices—simulating before acting and learning from hindsight—embodied LLMs can transform a sequence of repeated failures into a cumulative experience, significantly enhancing their capability to solve complex, long-horizon tasks.

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

1. The "Mental Sandbox" (Reflection-in-Action)

2. The "Post-Game Review" (Reflection-on-Action)

3. The "Hindsight Lookback" (Retro-Reflection)

The Big Result: Learning While Doing

1. Problem Statement

2. Methodology: Reflective Test-Time Planning

A. Reflection-in-Action (Pre-Execution)

B. Reflection-on-Action (Post-Execution)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

CIPHER: Conformer-based Inference of Phonemes from High-density EEG

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

Skeleton-based Coherence Modeling in Narratives

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets