Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory

Imagine you are teaching a robot to play a complex game of physical chess, but the rules of the game (friction, weight, balance) change every time you sit down at the table.

This paper introduces PhysMem, a new way for robots to learn these changing rules while they are playing, without needing to be reprogrammed by a human engineer.

Here is the breakdown using simple analogies:

1. The Problem: The Robot with a "Textbook" Brain

Current robot brains (called Vision-Language Models) are like brilliant students who have read every physics textbook in the library. They know the theory of friction and gravity.

The Issue: If you ask them, "How will this specific bouncy ball roll on this specific dusty table?" they often guess wrong. They know the general rule, but they lack the "street smarts" of how this specific object behaves in this specific moment.
The Result: They try a plan, fail, and then try the exact same wrong plan again because they don't know they failed. They are stuck in a loop of "textbook theory" vs. "messy reality."

2. The Solution: The "Scientific Lab" Robot

The authors created a system called PhysMem. Think of this robot not just as a worker, but as a scientist running a tiny lab inside its own head.

Instead of just memorizing every single move it ever made (which is like trying to remember every single step you took in your life), PhysMem uses a three-step scientific process:

Step A: The "Surprise" Detector (Experience Collection)

The robot tries to do a task.

If it succeeds exactly as predicted, it says, "Okay, my theory holds."
If it fails or something weird happens (a "surprise"), it flags it: "Wait, I thought the ball would stop here, but it rolled past the obstacle! My theory is wrong."

Step B: The "Hypothesis" Phase (Working Memory)

Instead of just storing the failure, the robot groups similar surprises together and asks its internal "brain" (a large language model): "What rule explains this?"

It generates a hypothesis (a guess).
Example Hypothesis: "Maybe I shouldn't push the ball fast when it's near the purple block."
This hypothesis is put in a "Testing Zone" (Working Memory). It's not a fact yet; it's just a theory being tested.

Step C: The "Verification" Phase (The Crucial Step)

This is the most important part. Before the robot accepts the rule as truth, it tests it.

It tries the new strategy on the next few attempts.
If it works: The hypothesis is promoted to a Verified Principle (Long-Term Memory). It becomes a permanent rule the robot follows.
If it fails: The hypothesis is thrown out. The robot learns, "Okay, that theory was wrong," and tries a new one.

3. The Magic Trick: "Folding" the Memory

Imagine you are learning to ride a bike. You fall 50 times. Do you need to remember the exact angle of your foot and the wind speed for all 50 falls? No.

PhysMem does "Memory Folding": It takes those 50 messy experiences and compresses them into one simple rule: "Lean left when turning right."
This keeps the robot's brain from getting cluttered with useless details, allowing it to remember the lesson without remembering the mess.

4. Real-World Results: The Robot Gets Smarter

The team tested this on three real-world tasks:

Packing Puzzle: Fitting weirdly shaped blocks into a box.
Ball Navigation: Pushing a soccer ball through an obstacle course.
Stone Stacking: Building a tower with irregular, slippery stones.

The Outcome:

Without PhysMem: The robot kept making the same mistakes, like a student failing the same math problem over and over.
With PhysMem: The robot started with a few failures, but within about 10 tries, it had "learned" the physics of the specific objects. It started making fewer mistakes and solving the puzzles much faster.
The "Aha!" Moment: In one experiment, the robot learned that pushing a ball too fast near a specific obstacle would make it get stuck. It turned this into a permanent rule: "Always push slowly near the purple block."

The Big Picture

This paper is about giving robots the ability to learn from their own mistakes in real-time.

Instead of being a rigid machine that follows a pre-written script, PhysMem turns the robot into a curious learner. It treats every failure as a scientific experiment, tests a new theory, and if the theory works, it adds it to its permanent rulebook.

In short: It's the difference between a robot that says, "I read in a book that balls roll," and a robot that says, "I tried it, fell down, figured out the trick, and now I know exactly how to roll this ball."

Here is a detailed technical summary of the paper "Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory" (PhysMem).

1. Problem Statement

Vision-Language Models (VLMs) possess strong declarative knowledge about physical concepts (e.g., friction, stability, momentum) but often fail to apply these principles to specific, novel physical scenarios during robot planning.

The Gap: A VLM may understand "friction" abstractly but cannot predict exactly how far a specific ball will roll on a specific surface or which irregular stone will provide a stable foundation without direct experience.
The Limitation: Existing approaches often rely on episodic retrieval (retrieving past experiences directly). However, embodied situations rarely repeat exactly. Retrieving raw experiences without verification leads to rigid behavior; a small change in object shape or friction can turn a previously successful heuristic into a repeated error.
The Goal: To enable VLM robot planners to acquire useful, task-specific physical understanding during deployment (test-time) through interaction, without updating the underlying model parameters.

2. Methodology: PhysMem Framework

PhysMem is a memory framework that implements a scientific memory loop to transform raw interaction data into verified, human-readable physical principles. It separates high-level planning (VLM) from low-level control (execution) to isolate reasoning improvements.

A. Three-Tier Memory Architecture

Episodic Memory: Stores raw experiences $(o, \omega, r, c, s)$ including observations, actions, outcomes, and symbolic states.
Working Memory: Holds hypotheses generated from clustered experiences. These are candidate rules currently under testing.
Long-Term Memory: Stores verified principles that guide future decisions. These are human-readable rules (e.g., "AVOID X when Y").

B. The Scientific Memory Loop

The system operates through four distinct phases:

Experience Collection & Resonance Checking:
- The system records successes and failures.
- It calculates a Resonance Score ( $\rho$ ): How well the current outcome aligns with active principles.
- Surprise Detection: If $\rho < 1$ (the outcome violates current principles), the experience is flagged for consolidation. If $\rho = 1$ , it reinforces existing principles.
Hypothesis Generation:
- Experiences are clustered by symbolic similarity.
- A reflection model (VLM/LLM) analyzes clusters to generate candidate hypotheses in typed formats:
  - AVOID: "Don't do X when Y" (from failures).
  - PREFER: "Do X when Y" (from successes).
  - SEQUENCE: "Do X before Y" (temporal constraints).
Action-Level Attribution:
- Hypotheses are evaluated based on specific action outcomes, not just episode success. This isolates the effect of a specific decision from execution noise.
- Confidence scores are updated based on the ratio of supporting vs. contradicting evidence.
Verification & Promotion (The Core Innovation):
- Verification Before Application: Hypotheses are not applied immediately. They are tested through targeted interactions.
- Promotion: Only hypotheses with high confidence ( $\ge 0.8$ ) and sufficient supporting evidence ( $\ge 3$ episodes) are promoted to Long-Term Memory as principles.
- Refutation: Low-confidence hypotheses with contradicting evidence are discarded.
- Memory Folding: Once a principle is promoted, the supporting raw experiences are "folded" (compressed) into the principle to save memory and reduce context size.

3. Key Contributions

Test-Time Principle Learning: A novel framework that allows VLMs to learn physical principles from interaction without fine-tuning model weights.
Verification-First Design: Unlike retrieval-augmented methods that blindly apply past experience, PhysMem verifies hypotheses against new observations before promoting them, preventing "dogmatic" reliance on outdated rules.
Human-Readable Knowledge: The system produces interpretable, text-based principles (e.g., "Use low speed after passing the archway") that can be inspected, edited, or transferred, unlike opaque policy updates.
Scientific Memory Loop: A structured process of clustering, hypothesis generation, attribution, and verification that mimics the scientific method.

4. Experimental Results

The system was evaluated on three real-world manipulation tasks and a simulation benchmark across four VLM backbones (including Gemini-3-Flash, GPT-5.1, Qwen3-VL).

A. Real-World Tasks

Parts Organization: Packing irregular 3D shapes onto a grid.
- Result: PhysMem improved performance from -1 to 9.7 (score) over 30 minutes, while the no-memory baseline remained near 0.
Ball Navigation: Pushing a ball through obstacles.
- Result: Performance gap of 14.7 vs. 0.7 compared to no memory. The system learned specific dynamics like speed adjustments after passing archways.
Balanced Stacking: Stacking irregular stones.
- Result: Consistent improvement in stability and tower height.

B. Simulation Benchmarks (Brick Insertion)

Principle Abstraction vs. Direct Retrieval:
- Direct Retrieval: 23% success rate (fails due to rigid reliance on exact matches).
- PhysMem (Principled Abstraction): 76% success rate.
Scaling: Performance stabilizes after learning 16–64 principles.
VLM Capability: Test-time learning amplifies the capabilities of stronger models (e.g., Gemini-3-Flash improved by +23%), suggesting that effective hypothesis generation requires a strong base reasoning capability.

C. Out-of-Distribution (OOD) Transfer

Similar Physics: When physics are similar (e.g., new stones with similar friction), prior principles transfer well (80% success).
Different Physics: When dynamics change significantly (e.g., new ball types with different elasticity), prior knowledge alone fails (matches zero-shot). However, test-time adaptation allows the system to learn new dynamics, improving success from 10% to 40%.

5. Significance and Impact

Bridging the Grounding Gap: PhysMem effectively bridges the gap between declarative knowledge (what the VLM knows) and physical grounding (how the robot acts), enabling robots to adapt to novel physical environments in real-time.
Interpretability: By outputting human-readable rules, the system allows for debugging and trust, addressing the "black box" nature of many RL or fine-tuning approaches.
Efficiency: The "Memory Folding" mechanism ensures that the system remains computationally tractable over long deployments by compressing redundant experiences.
Future Direction: This work suggests a path toward "lifelong learning" for robots where they grow wiser through experience, accumulating a library of verified physical laws rather than just memorizing specific trajectories.

In conclusion, PhysMem demonstrates that structured, verification-based memory systems can significantly enhance the physical reasoning capabilities of VLMs, allowing them to learn complex, task-specific physical principles through interaction without requiring model retraining.