PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

PhyPrompt introduces a two-stage reinforcement learning framework that automatically refines text-to-video prompts through physics-focused fine-tuning and a dynamic reward curriculum, significantly enhancing physical plausibility and semantic adherence across diverse models while outperforming much larger general-purpose LLMs.

Shang Wu, Chenwei Xu, Zhuofan Xia, Weijian Li, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are a director trying to film a movie with a very talented, but slightly naive, special effects artist. This artist (the AI video generator) is amazing at making things look beautiful and colorful. However, they don't quite understand how the real world works. If you ask them to "pour wine into a glass," they might make the wine flow beautifully, but the liquid level in the glass might never rise, or the bottle might float away. They are following your words, but ignoring the laws of physics.

PhyPrompt is like hiring a brilliant, physics-savvy scriptwriter who stands between you and the artist. Their job is to take your simple idea and rewrite it into a detailed script that forces the artist to get the physics right, without changing your original vision.

Here is how PhyPrompt works, broken down into simple steps:

1. The Problem: The "Naive Artist"

Current AI video generators are like that special effects artist. They are great at making things look pretty (high "visual quality"), but they often break the laws of physics.

  • The Issue: If you say, "A ball falls," the AI might make it fall, but it might bounce up forever or pass through the floor.
  • The Old Fix: Humans had to manually rewrite the prompts to be super specific (e.g., "The ball falls due to gravity and stops when it hits the floor"). This works, but it's slow, boring, and requires you to be a physics expert.

2. The Solution: The "Smart Scriptwriter" (PhyPrompt)

PhyPrompt is an AI system that acts as that smart scriptwriter. It doesn't generate the video itself; it just rewrites your prompt before the video is made. It uses a two-step training process to learn how to do this perfectly.

Step 1: The "Textbook" Phase (Supervised Fine-Tuning)

First, the scriptwriter (a Large Language Model) reads a special textbook. This textbook contains thousands of examples where a simple prompt is turned into a physics-perfect prompt, complete with a "thought process" explaining why the change was made.

  • Analogy: It's like a student reading a guidebook that says, "When you ask for a falling apple, you must mention gravity and the ground, otherwise the apple will float."

Step 2: The "Practice Game" Phase (Reinforcement Learning)

Now, the scriptwriter starts practicing. It takes your prompt, rewrites it, and sends it to the video generator. Then, a "Judge" (an automated evaluator) watches the resulting video and gives it two scores:

  1. Did it look like what you asked for? (Semantic Adherence)
  2. Did it obey the laws of physics? (Physical Commonsense)

Here is the clever part: The scriptwriter plays a dynamic game.

  • Early in training: The game tells the scriptwriter, "Don't worry about physics yet! Just make sure you describe the scene correctly." (Focus on what is happening).
  • Later in training: Once the scriptwriter knows how to describe the scene, the game shifts gears: "Now, make sure the scene makes physical sense!" (Focus on how it happens).

Why this matters: If you try to learn both at once, the scriptwriter gets confused and fails at both. By teaching them one thing first, then the other, they learn to do both perfectly. It's like learning to drive: first, you learn to steer and stop (the basics); once you master that, you learn to parallel park (the advanced physics).

3. The Results: Magic Without the Magic

The paper shows that PhyPrompt is incredibly effective:

  • Better than Humans: It can rewrite prompts as well as a human expert, but it does it instantly and doesn't get tired.
  • Better than Giant Models: It beats massive AI models (like GPT-4o or DeepSeek-V3) that are 100 times bigger, even though PhyPrompt is much smaller. It proves that specialized training is better than just making the AI bigger.
  • Works Everywhere: The best part? You train it on one type of video generator, and it works perfectly on other generators without needing to be retrained. It's like a universal translator that understands physics, no matter which "camera" you use.

The Big Takeaway

PhyPrompt solves the problem of "beautiful but impossible" videos. It teaches AI to respect the laws of physics (gravity, collisions, fluid dynamics) by acting as a smart middleman.

In short: It takes your simple wish ("Pour the wine"), adds the necessary physics details ("The liquid rises steadily"), and hands it to the video maker. The result? A video that looks exactly like you imagined, but also behaves exactly like the real world. No more floating wine bottles or teleporting balls!