Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation

Imagine you are teaching a robot to build a tower of blocks. In the old days, you had two main options:

The "Drill Sergeant" Method: You show the robot a video of a human doing it perfectly 10,000 times. The robot memorizes the movements by rote. If you change the lighting or the table, the robot gets confused and fails.
The "Trial and Error" Method: You let the robot try, fail, and try again, but you have to manually tweak its "brain" (its neural network) after every mistake. This takes a long time and requires a supercomputer.

This paper introduces a third way: The "Self-Taught Programmer."

The authors call this Act–Observe–Rewrite (AOR). Here is how it works, explained with a simple story and some analogies.

The Story: The Robot and the "Ghost Writer"

Imagine a robot arm (the Actor) trying to pick up a red cube and put it on a table.

Act: The robot tries to do the task. It moves its arm, grabs the cube, and places it.
Observe: The robot fails. Maybe it missed the cube, or it dropped it. But instead of just saying "I failed," a special AI assistant (a Multimodal LLM) watches the video of the failure. It looks at the robot's code (the instructions telling the robot how to move) and the video of the mistake side-by-side.
Rewrite: The AI assistant doesn't just say, "Try harder." It acts like a Ghost Writer for the robot's brain. It opens the robot's source code, finds the exact line causing the error, and rewrites it.

Then, the robot compiles this new code and tries again. No human touched the code. No massive training data was used. The robot literally wrote a better version of itself after every failure.

The Magic Analogy: The "Code vs. The Dial"

To understand why this is special, imagine a radio.

Old Methods (Parameter Tuning): Imagine the robot is a radio with a dial. If the signal is bad, you just turn the dial slightly left or right. You are guessing. If the radio is broken because the antenna is in the wrong place, turning the dial won't help.
This Paper's Method (Code Rewriting): Imagine the robot is a radio, but the AI assistant is an engineer. If the signal is bad, the engineer doesn't just turn a dial. They open the radio, look at the circuit board, and realize, "Ah, the antenna is wired backward!" They solder a new wire (rewrite the code) to fix the root cause.

Why is this a big deal?
Most robot learning methods only "turn the dial." They tweak numbers. But this paper shows that if you let the AI rewrite the actual instructions, it can fix deep, structural problems that numbers can't solve.

The Three Challenges They Solved

The researchers tested this on three tasks, and the "Ghost Writer" solved them in very human-like ways:

The "Wrong Map" Problem (Lift Task):
- The Mistake: The robot kept hovering 8 inches above the cube. It thought the cube was floating in the air.
- The Old Way: You'd have to guess, "Maybe the camera is too sensitive?"
- The AOR Way: The AI looked at the code and the video. It said, "The camera uses a different coordinate system (like a map where North is Down). The code is reading the map upside down." It rewrote the math to flip the image. Success.
The "Colorblind" Problem (Pick & Place Can):
- The Mistake: The robot was looking for a "silver" soda can, but in the camera's view, the can looked red. The robot couldn't find it.
- The AOR Way: The AI saw the robot staring at empty space. It looked at the code: search for silver. It realized, "Wait, the lighting makes it look red!" It changed the code to search for red. Success.
The "Clumsy Hand" Problem (Stack Task):
- The Mistake: The robot could pick up the first cube, but when it tried to stack a second one, its fingers kept bumping the bottom cube and knocking it over.
- The AOR Way: The AI saw the bump in the video. It rewrote the approach path to be more careful. It got the robot to 91% success.
- The Limit: The robot eventually got stuck. It knew why it was failing (it was bumping the block), but it couldn't figure out how to move its fingers differently to avoid it. It hit a wall. This shows the system isn't perfect yet, but it's honest about its limits.

The "No-Go" Zone (What Makes This Unique)

Usually, to get a robot to learn, you need:

Thousands of videos of humans doing the task.
Complex math to reward the robot for good moves.
Supercomputers to train the brain.

AOR needs none of that.

No Videos: It learns from its own mistakes.
No Rewards: It doesn't need a score; it just needs to see the code didn't work.
No Training: It doesn't "learn" in the background; it literally rewrites its own software between attempts.

The Bottom Line

This paper proposes a new way to build robots: Don't train the robot to be smart; give it a smart editor that rewrites its own manual.

It's like giving a robot a "Ctrl+Z" (Undo) button, but instead of just undoing the move, it rewrites the instructions so the mistake never happens again. It turns robot learning from a "black box" (where we don't know why it works) into a transparent process where we can read the code and say, "Ah, that's why it failed, and here is the fix."

While it's not perfect yet (it sometimes gets stuck on very tricky physical problems), it proves that robots can learn to fix their own code just by watching themselves fail and thinking about it.

Task	Description	Success Rate	Iterations (LLM Calls)	Key Findings
Lift	Pick up a red cube.	100%	3	Rapidly converged by fixing a 2.5cm depth vision bias and switching to a "stationary grasp" (holding position while closing gripper).
PickPlaceCan	Pick a cola can and place in a bin.	100%	2	Diagnosed that the can rendered as red (not silver) and that a bin marker was contaminating the centroid calculation. Fixed by changing color segmentation and filtering logic.
Stack	Stack a red cube on a green cube.	91%	20	Fixed systematic vision pipeline bugs (back-projection sign errors, extrinsic matrix mismatches) reducing error from 5-8cm to ~1.5cm. Residual 9% failure was due to gripper contact with the target cube, which the agent diagnosed but could not resolve within the iteration budget.

Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation

The Story: The Robot and the "Ghost Writer"

The Magic Analogy: The "Code vs. The Dial"

The Three Challenges They Solved

The "No-Go" Zone (What Makes This Unique)

The Bottom Line

1. Problem Statement

2. Methodology: The Act–Observe–Rewrite (AOR) Framework

Core Architecture

Key Distinctions from Prior Work

3. Key Contributions

4. Experimental Results

5. Significance and Limitations

Conclusion

Act-Observe-Rewrite: Multimodal Coding Agents as In-Context Policy Learners for Robot Manipulation

The Story: The Robot and the "Ghost Writer"

The Magic Analogy: The "Code vs. The Dial"

The Three Challenges They Solved

The "No-Go" Zone (What Makes This Unique)

The Bottom Line

1. Problem Statement

2. Methodology: The Act–Observe–Rewrite (AOR) Framework

Core Architecture

Key Distinctions from Prior Work

3. Key Contributions

4. Experimental Results

5. Significance and Limitations

Conclusion

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers