OSCBench: Benchmarking Object State Change in Text-to-Video Generation

This paper introduces OSCBench, a novel benchmark derived from instructional cooking data that evaluates the ability of text-to-video models to generate accurate and temporally consistent object state changes, revealing that current models struggle significantly with this capability despite their progress in other areas.

Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you are teaching a robot chef how to cook. You give it a simple instruction: "Peel and slice this potato."

If the robot chef is really good, it will show you a video where the potato starts whole, gets peeled, and ends up as neat slices. But what if the robot chef is just "pretending"? It might show a potato that looks like it's being peeled, but the skin never actually comes off, or the potato magically turns into a slice without ever being cut. Or worse, it might peel a carrot instead of a potato because it's confused.

This is exactly the problem the paper OSCBench is trying to solve.

The Big Idea: The "Magic Trick" Test

Current AI video generators (Text-to-Video models) are amazing at making things look pretty. If you ask for a video of a chef in a kitchen, the lighting, the chef's apron, and the kitchen background usually look perfect.

However, these AIs are terrible at Object State Change (OSC).

  • State Change means an object actually transforming from one thing to another (e.g., Whole Lemon \rightarrow Sliced Lemon).
  • The Problem: The AIs are great at the "setup" (the chef, the knife, the kitchen) but fail at the "magic trick" (the actual cutting). They often produce videos where the object stays whole, disappears, or changes into something weird.

The authors built a new test called OSCBench to specifically check if these AIs can handle the "magic trick" of changing an object's state.

How They Built the Test (The Recipe Book)

To test this, the researchers didn't just make up random ideas. They looked at thousands of real cooking videos (like "HowTo100M") because cooking is full of clear state changes: chopping, peeling, frying, and mixing.

They organized their test into three levels of difficulty, like a video game:

  1. Regular Level (The Basics):
    • Example: "Slice a lemon."
    • Why: This is common. The AI has probably seen this a million times in its training data. It's like asking a human to tie their shoes.
  2. Novel Level (The Twist):
    • Example: "Peel a berry."
    • Why: You don't usually peel berries! The AI can't just "memorize" the answer. It has to understand what "peeling" means and apply it to a new object. This is like asking a human to tie a knot they've never seen before.
  3. Compositional Level (The Combo):
    • Example: "Peel and then slice a pear."
    • Why: This requires doing two things in a row, where the second step depends on the first one being done right. It's like asking a human to peel a potato, then slice it, without dropping it or forgetting the knife.

The Results: The "Uncanny Valley" of Cooking

The researchers tested six top-tier AI video models (some free and open, some paid and secret). Here is what they found:

  • The Good News: The AIs are fantastic at the "decor." They get the chef, the kitchen, and the lighting right.
  • The Bad News: They are terrible at the "cooking."
    • In many videos, the knife passes through the fruit without cutting it.
    • Sometimes the fruit turns into slices instantly without the action happening.
    • In the "Combo" level, the AI often forgets the first step (peeling) and just does the second (slicing), or vice versa.

The Analogy: Imagine a movie special effects team. They can build a perfect set and dress the actors perfectly. But when the actor is supposed to eat an apple, the AI makes the apple disappear or turn into a rock. The movie looks great, but the physics are broken.

How Did They Grade the AIs?

They used two methods:

  1. Human Judges: Real people watched the videos and gave them scores.
  2. AI Judges (MLLMs): They used a super-smart AI (like GPT-5) to watch the videos. But they didn't just ask the AI "Is this good?" They gave it a Checklist (Chain-of-Thought).
    • Step 1: "Look at the lemon. Is it whole?"
    • Step 2: "Look at the next frame. Is it sliced?"
    • Step 3: "Did the transition happen smoothly, or did it jump?"
    • Step 4: "Give a score."

They found that the AI judges were surprisingly good at spotting these errors, almost as good as the humans. This is huge because it means we can automatically test future AIs without hiring hundreds of people.

The Takeaway

The paper concludes that while AI video generation is getting visually stunning, it still lacks a deep understanding of cause and effect. It knows what a "knife" looks like, but it doesn't fully understand that a knife cuts things.

OSCBench is like a new driving test for AI. Before, we just checked if the car looked nice. Now, we are checking if the car can actually stop at a red light and turn left without crashing. Until AIs pass this test, we can't fully trust them to simulate real-world actions, like helping robots in factories or creating realistic instructional videos.

In short: The AI can draw a beautiful picture of a chef, but it can't yet make the chef actually cook the meal.