Vision Language Models Cannot Reason About Physical Transformation

The Big Idea: The "Magic Trick" Failure

Imagine you have a magic trick where you pour a glass of water into a tall, skinny vase. The water level shoots up high, looking like there is way more water than before. A human knows instantly: "It's the same amount of water, just in a different shape."

This paper asks a simple question: Do modern AI models (Vision Language Models) understand this trick?

The answer is a resounding no. The researchers found that even the smartest AI models today are terrible at understanding that physical things (like water, coins, or playdough) stay the same amount even when they change shape or position. They are like a magician who gets confused by their own trick.

The Experiment: "Conservation Bench"

The researchers built a test called Conservation Bench. Think of it as a playground for AI, but instead of swings and slides, it has videos of physical changes.

They created 4 types of puzzles:

Number: Spreading out a row of coins so they look like there are more.
Length: Laying a straw flat vs. standing it up.
Volume: Pouring water from a short cup into a tall glass.
Size: Squishing a ball of playdough into a flat pancake.

For every puzzle, they had two versions:

The "Conserving" Version: The amount stays the same (e.g., water is poured, but none is lost).
The "Non-Conserving" Version: The amount actually changes (e.g., water is poured, but some is left behind in the old cup).

They tested 112 different AI models on over 23,000 questions.

The Results: The "Guessing Game"

The results were surprising and a bit scary for the future of AI.

1. The AI is just guessing (mostly)
Most models got the questions right only about 33% of the time. That's the same as flipping a coin or guessing randomly. They aren't actually "seeing" the physics; they are just guessing.

2. The "Text Bias" Trap
Here is the weirdest part. The researchers found that the AI models have a strong habit in their "brain" (their training data) that says: "If you ask me if something changed, the answer is usually 'No, it stayed the same'."

The Test: They showed the models the questions but with blank white screens (no images).
The Result: The models got the "Conserving" questions right 85% of the time!
The Twist: When they showed the models the real videos, their performance dropped.

The Analogy: Imagine a student taking a math test.

If the teacher asks, "Is 2+2 still 4 if I write it in red ink?" the student says "Yes!" because they know the rule.
But if the teacher shows a video of someone actually erasing a number, the student panics and says, "Wait, maybe it changed!"
The AI is so reliant on its "textbook rules" that the actual visual evidence confuses it and makes it fail.

3. More Frames Don't Help
The researchers tried giving the AI more video frames (showing the whole movie instead of just a snapshot) and tried different ways of asking the questions (like "Think step-by-step").

Result: It didn't matter. The AI still couldn't track the object through time. It's like giving a person a slow-motion video of a ball being thrown, but they still can't tell if the ball is moving forward or backward.

Why Does This Matter?

You might think, "So what? It's just a game with coins." But this is a huge problem for the future of AI.

Robots: If you want a robot to pour coffee without spilling, or a self-driving car to understand that a puddle is just water and not a hole in the road, it needs to understand physical conservation.
The "Black Box" Problem: The AI isn't failing because it's "dumb." It's failing because it doesn't have a mental model of how the world works. It's memorizing patterns, not understanding reality.

The Conclusion

Current AI models are like parrots. They can repeat the phrase "The amount of water stays the same" because they've heard it a million times in books. But if you show them a real glass of water being poured, they don't see the physics; they just get confused by the pixels.

Until AI can truly understand that changing the shape of something doesn't change what it is, we can't trust them to operate safely in the real, physical world. They are brilliant at reading, but terrible at doing.

1. Problem Statement

While Vision Language Models (VLMs) have demonstrated significant progress in perception and static reasoning, their ability to understand physical transformations in dynamic environments remains unverified. Specifically, it is unclear whether VLMs possess the cognitive capacity to reason about conservation—the principle that certain physical quantities (e.g., volume, number, mass) remain invariant despite changes in appearance or spatial arrangement.

Current benchmarks often focus on static scenes or video generation, failing to test if models can track objects over time, handle occlusions, or distinguish between superficial visual changes and underlying physical invariance. This gap limits the deployment of VLMs in embodied AI tasks (e.g., robotics) that require robust physical reasoning.

2. Methodology: ConservationBench

The authors introduce ConservationBench, a cognitively grounded benchmark designed to evaluate whether VLMs can reason about physical transformations.

Task Design: The benchmark consists of 384 video-based tasks across four quantitative properties:
1. Number: Spreading coins apart (invariant count).
2. Length: Repositioning straws (invariant length).
3. Volume: Pouring liquid into different-shaped containers (invariant volume).
4. Size: Reshaping playdough (invariant mass/size).
Conserving vs. Non-Conserving Controls:
- Conserving Tasks: The physical quantity remains constant despite visual transformation.
- Non-Conserving Controls: The quantity changes (e.g., liquid is not fully poured, coins are added/removed) while irrelevant features remain constant. This controls for models simply defaulting to "invariance" due to linguistic priors.
Experimental Variables:
- Temporal Resolution: Tested with 3, 5, 7, 9, and 16 frames to determine if higher frame rates aid reasoning.
- Sampling Strategies: Compared Uniform sampling, Human-selected frames, and Model-based (SEVILA) selection.
- Prompting: Evaluated Direct, Sequential, Chain-of-Thought (CoT), and Continuous prompts.
Scale: The study evaluated 112 VLMs (ranging from 1B to 76B parameters, including commercial and open-source models), generating 23,040 total trials across 60 experimental conditions.

3. Key Results

A. Systematic Failure of VLMs

Performance: Across 112 models, accuracy on conservation tasks ranged from 20% to 69%, with most models performing only marginally above the 33.3% chance level.
Human Baseline: Human participants achieved 98.35% accuracy, highlighting a massive gap between current VLMs and human intuitive physics.
Strict Evaluation: When requiring models to answer both a conserving and a non-conserving pair correctly (to prove genuine reasoning rather than bias), 73.2% of models performed below chance (<10% strict accuracy). Only three top-tier models (Gemini-2.5-Pro, Doubao-Seed-1.6-Vision, Claude-Sonnet-4-5) exceeded chance levels.

B. The "Textual Prior" vs. Visual Interference

Control experiments revealed a critical mechanism behind the failures:

Textual Bias: When visual content was removed (Empty Image or Text-only controls), models overwhelmingly answered "Conserve" (invariance) regardless of the prompt. This indicates a strong textual prior favoring quantity invariance.
Visual Interference: Paradoxically, when real visual content was provided, model performance dropped compared to the text-only condition.
- Interpretation: Models possess a correct "default" bias toward invariance (likely learned from text), but their visual processing is flawed. The visual input actively interferes with this correct prior, causing models to incorrectly reject invariance when they see shape changes. They fail to extract transformation-relevant information from sequential visual evidence.

C. Ineffectiveness of Mitigation Strategies

The study tested several hypotheses to improve performance, all of which failed:

Temporal Resolution: Increasing frame counts (from 3 to 16) did not improve performance. Models cannot integrate sequential evidence to track continuous changes.
Prompting: Chain-of-Thought (CoT) prompting actually degraded performance on transformation-helpful tasks, likely by amplifying reliance on brittle heuristics. "Continuous" prompts offered minor benefits but were insufficient for mandatory tasks (Volume/Size).
Sampling: Curated frame selection (Human or Model-based) did not help; in fact, for mandatory tasks, uniform sampling often outperformed curated methods, suggesting models cannot utilize task-relevant biases even when frames are "optimal."

D. Scaling Laws Do Not Apply

Unlike many other capabilities where performance scales with parameter count, conservation reasoning does not emerge with scale.
There was virtually no correlation between model size (1B–76B) and conservation accuracy ( $R^2 = 0.019$ ).
Larger models showed slightly better performance on non-conserving controls (detecting changes), but this did not translate to better conservation reasoning.

4. Key Contributions

ConservationBench: A rigorous, cognitively grounded benchmark that moves beyond static counting to test dynamic physical reasoning and invariance.
Diagnostic of Failure Modes: The paper identifies that VLMs fail not due to a lack of knowledge, but due to an inability to form transformation-invariant representations. They rely on brittle heuristics and suffer from visual interference that overrides correct textual priors.
Negative Evidence on Scaling: Demonstrates that simply scaling up model parameters or adding more frames does not solve the fundamental deficit in physical reasoning.
Methodological Framework: Provides a comprehensive evaluation protocol including non-conserving controls, temporal resolution variations, and bias dissociation experiments.

5. Significance and Implications

Embodied AI Limitations: The findings suggest that current VLMs are fundamentally unsuited for real-world embodied tasks (robotics, navigation) where understanding physical dynamics is critical. Without the ability to reason about conservation, agents cannot reliably predict the outcomes of physical interactions.
Architectural Deficits: The results point to a mechanistic limitation in how VLMs encode visual-spatial information. Current coarse-grained visual encodings appear incapable of supporting the structured, sequential reasoning required for physical inference.
Future Directions: The paper calls for new architectures that can explicitly model temporal continuity and physical constraints, rather than relying on statistical correlations in text or static image patterns. ConservationBench serves as a "sanity check" for future foundation models aiming to achieve robust physical intelligence.

In summary, the paper concludes that current Vision Language Models cannot genuinely reason about physical transformations. They fail to maintain invariant representations of physical properties across dynamic scenes, relying instead on superficial heuristics and textual biases that break down when confronted with actual visual evidence.