PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

Imagine you are a director trying to film a movie with a very talented, but slightly naive, special effects artist. This artist (the AI video generator) is amazing at making things look beautiful and colorful. However, they don't quite understand how the real world works. If you ask them to "pour wine into a glass," they might make the wine flow beautifully, but the liquid level in the glass might never rise, or the bottle might float away. They are following your words, but ignoring the laws of physics.

PhyPrompt is like hiring a brilliant, physics-savvy scriptwriter who stands between you and the artist. Their job is to take your simple idea and rewrite it into a detailed script that forces the artist to get the physics right, without changing your original vision.

Here is how PhyPrompt works, broken down into simple steps:

1. The Problem: The "Naive Artist"

Current AI video generators are like that special effects artist. They are great at making things look pretty (high "visual quality"), but they often break the laws of physics.

The Issue: If you say, "A ball falls," the AI might make it fall, but it might bounce up forever or pass through the floor.
The Old Fix: Humans had to manually rewrite the prompts to be super specific (e.g., "The ball falls due to gravity and stops when it hits the floor"). This works, but it's slow, boring, and requires you to be a physics expert.

2. The Solution: The "Smart Scriptwriter" (PhyPrompt)

PhyPrompt is an AI system that acts as that smart scriptwriter. It doesn't generate the video itself; it just rewrites your prompt before the video is made. It uses a two-step training process to learn how to do this perfectly.

Step 1: The "Textbook" Phase (Supervised Fine-Tuning)

First, the scriptwriter (a Large Language Model) reads a special textbook. This textbook contains thousands of examples where a simple prompt is turned into a physics-perfect prompt, complete with a "thought process" explaining why the change was made.

Analogy: It's like a student reading a guidebook that says, "When you ask for a falling apple, you must mention gravity and the ground, otherwise the apple will float."

Step 2: The "Practice Game" Phase (Reinforcement Learning)

Now, the scriptwriter starts practicing. It takes your prompt, rewrites it, and sends it to the video generator. Then, a "Judge" (an automated evaluator) watches the resulting video and gives it two scores:

Did it look like what you asked for? (Semantic Adherence)
Did it obey the laws of physics? (Physical Commonsense)

Here is the clever part: The scriptwriter plays a dynamic game.

Early in training: The game tells the scriptwriter, "Don't worry about physics yet! Just make sure you describe the scene correctly." (Focus on what is happening).
Later in training: Once the scriptwriter knows how to describe the scene, the game shifts gears: "Now, make sure the scene makes physical sense!" (Focus on how it happens).

Why this matters: If you try to learn both at once, the scriptwriter gets confused and fails at both. By teaching them one thing first, then the other, they learn to do both perfectly. It's like learning to drive: first, you learn to steer and stop (the basics); once you master that, you learn to parallel park (the advanced physics).

3. The Results: Magic Without the Magic

The paper shows that PhyPrompt is incredibly effective:

Better than Humans: It can rewrite prompts as well as a human expert, but it does it instantly and doesn't get tired.
Better than Giant Models: It beats massive AI models (like GPT-4o or DeepSeek-V3) that are 100 times bigger, even though PhyPrompt is much smaller. It proves that specialized training is better than just making the AI bigger.
Works Everywhere: The best part? You train it on one type of video generator, and it works perfectly on other generators without needing to be retrained. It's like a universal translator that understands physics, no matter which "camera" you use.

The Big Takeaway

PhyPrompt solves the problem of "beautiful but impossible" videos. It teaches AI to respect the laws of physics (gravity, collisions, fluid dynamics) by acting as a smart middleman.

In short: It takes your simple wish ("Pour the wine"), adds the necessary physics details ("The liquid rises steadily"), and hands it to the video maker. The result? A video that looks exactly like you imagined, but also behaves exactly like the real world. No more floating wine bottles or teleporting balls!

1. Problem Statement

State-of-the-art Text-to-Video (T2V) generators produce visually high-quality clips but frequently violate basic physical laws (e.g., objects teleporting, ignoring gravity, or passing through one another).

Root Cause: The paper argues this is not a limitation of the video generation models themselves, but rather a prompt deficiency. T2V models are trained on detailed captions but often receive brief, underspecified user prompts at inference.
The Bottleneck: Manually rewriting prompts to include explicit physical details (e.g., "the liquid level rises steadily") successfully yields physically plausible videos, but this requires domain expertise, is time-consuming, and does not scale.
Limitations of Existing Solutions:
- Promptist: Optimizes for aesthetics, not physics.
- GPT-4o/LLMs: Can improve physics but often degrade semantic fidelity (SA) and lack systematic optimization for video-level commonsense.
- PhyT2V: Uses iterative self-refinement but is inefficient due to complex step-back mechanisms and multiple prompting rounds.

2. Methodology: PhyPrompt Framework

PhyPrompt is a two-stage reinforcement learning (RL) framework that uses a Large Language Model (LLM) to automatically refine user prompts into physics-aware descriptions. The video generator remains frozen; only the prompt rewriter is trained.

A. Stage 1: Supervised Fine-Tuning (SFT)

Dataset Construction: The authors created a Chain-of-Thought (CoT) dataset based on PhyGenBench. It consists of triplets: (Original Prompt, Reasoning Chain, Enhanced Prompt).
Process: An LLM (Qwen2.5) is fine-tuned on this dataset to learn how to reason about physical laws (e.g., force, motion, fluid dynamics) and translate them into descriptive text while preserving the user's original intent.

B. Stage 2: Reinforcement Learning via GRPO

Algorithm: The authors employ Group Relative Policy Optimization (GRPO), which samples multiple candidate prompts per query to estimate advantages without needing a separate value network.
Pipeline:
1. User prompt $x$ is rewritten by the LLM into enhanced prompt $y$ .
2. $y$ is fed to a frozen T2V generator (e.g., CogVideoX-2B) to produce video $v$ .
3. An automated evaluator (VideoPhy2-Auto) scores $v$ on Semantic Adherence (SA) and Physical Commonsense (PC).
4. The LLM policy is updated based on these scores.

C. Key Innovation: Dynamic Reward Curriculum

The core challenge is the inherent conflict between Semantic Adherence (SA) and Physical Commonsense (PC); optimizing one often degrades the other (negative transfer). PhyPrompt solves this with a time-dependent reward curriculum:

Early Training: The reward heavily weights Semantic Adherence ( $w_{sa} \approx 1$ ). This establishes a "semantic scaffold" (correct objects, relationships, scene structure).
Late Training: The reward progressively shifts weight to Physical Commonsense ( $w_{pc} \to 1$ ). This refines the scaffold with specific physical details (forces, dynamics, causality).
Mechanism: The weights decay exponentially over training steps $t$ :
$w_{sa}(t) = \exp(-\alpha t/T), \quad w_{pc}(t) = 1 - w_{sa}(t)$
Result: This staged approach allows the model to discover compositional prompt structures that exceed the performance limits of single-objective optimization.

3. Key Contributions

Demonstration of Capability: Proved that current T2V models can generate physically plausible videos if provided with physics-aware prompts, identifying the prompt as the bottleneck.
Novel Framework (PhyPrompt): Introduced a two-stage training pipeline (SFT + GRPO) that automates physics-aware prompt engineering without human expertise.
Dynamic Reward Curriculum: Proposed a novel RL mechanism that sequentially optimizes for semantic fidelity and then physical realism, achieving synergistic improvements that surpass static multi-objective trade-offs.
Zero-Shot Transferability: The trained rewriter transfers effectively across diverse, unseen T2V architectures (Lavie, VideoCrafter2, CogVideoX) without re-tuning.

4. Experimental Results

The framework was evaluated on the VideoPhy2 benchmark using four T2V generators (Lavie, VideoCrafter2, CogVideoX-2B, CogVideoX-5B).

Performance on CogVideoX-2B (PhyPrompt-7B):
- Semantic Adherence (SA): 47.8% (vs. 43.4% baseline).
- Physical Commonsense (PC): 66.8% (vs. 55.8% baseline).
- Joint Success (SA $\ge$ 4 AND PC $\ge$ 4): 40.8% (an 8.6 percentage point gain over the baseline).
- Comparison: Outperforms GPT-4o (+3.8% joint) and DeepSeek-V3 (which is 100x larger) despite using only 7B parameters.
Zero-Shot Transfer:
- When applied to VideoCrafter2, PhyPrompt improved joint success by 16.8%.
- It consistently outperformed baselines (Promptist, PhyT2V, GPT-4o) across all tested generators.
Ablation Studies:
- Single-Objective Failure: Training solely on SA or PC resulted in severe negative transfer (e.g., SA-only training dropped PC by 3.0%).
- Curriculum Superiority: The dynamic curriculum achieved higher scores on both metrics compared to the best single-objective models, proving it discovers prompt regions unreachable by static optimization.
- Qualitative Example: In a "hammer and nail" scenario, static weighting failed to include the nail (semantic failure), while PhyPrompt's dynamic curriculum correctly described the nail and the force dynamics, resulting in a physically coherent video.

5. Significance

Efficiency over Scaling: The paper demonstrates that domain-specialized training with direct task feedback is more effective than simply scaling up model parameters (e.g., PhyPrompt-7B beats DeepSeek-V3 671B).
Solving Multi-Objective Conflict: It provides a blueprint for resolving conflicting objectives in generative AI (semantics vs. physics) through compositional curricula, showing that sequential optimization can yield superadditive results.
Practical Deployment: By keeping the video generator frozen and training only a lightweight, model-agnostic rewriter, PhyPrompt offers a scalable, parameter-efficient solution for making T2V systems suitable for robotics, scientific visualization, and education where physical realism is critical.

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

1. The Problem: The "Naive Artist"

2. The Solution: The "Smart Scriptwriter" (PhyPrompt)

Step 1: The "Textbook" Phase (Supervised Fine-Tuning)

Step 2: The "Practice Game" Phase (Reinforcement Learning)

3. The Results: Magic Without the Magic

The Big Takeaway

1. Problem Statement

2. Methodology: PhyPrompt Framework

A. Stage 1: Supervised Fine-Tuning (SFT)

B. Stage 2: Reinforcement Learning via GRPO

C. Key Innovation: Dynamic Reward Curriculum

3. Key Contributions

4. Experimental Results

5. Significance

More like this

AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics

Domain-Specialized Tree of Thought through Plug-and-Play Predictors

FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement

Me, Myself, and π\piπ : Evaluating and Explaining LLM Introspection

Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection