HOCA-Bench: Beyond Semantic Perception to Predictive World Modeling via Hegelian Ontological-Causal Anomalies

Imagine you are teaching a robot to understand the world. You show it a video of a cat chasing a mouse. The robot can easily tell you, "That's a cat, that's a mouse, and the cat is running." This is Semantic Perception—it's good at naming things and describing what it sees.

But can the robot tell you why the mouse runs? Can it predict that if the cat jumps, the mouse will dodge? Can it spot a video where the cat suddenly turns into a toaster, or where the mouse floats upward like a helium balloon?

This is the problem the paper HOCA-Bench tries to solve. It argues that current AI video models are like amazing actors who can memorize lines but don't understand the plot. They can describe the scene perfectly, but they don't have a "physics engine" in their brain to understand how the world actually works.

Here is a simple breakdown of their solution:

1. The "Hegelian" Lens: Two Types of Glitches

The authors use a fancy philosophy idea (from a guy named Hegel) to split physical mistakes into two buckets. Think of it like checking a video game for bugs:

Bucket A: The "Identity" Glitch (Ontological Anomalies)
- The Metaphor: Imagine a character in a video game who suddenly has three heads, or a tree that turns into a sandwich.
- The Problem: The object itself is broken. It violates its own definition. "A cat is a cat; it shouldn't have a beak."
- The AI's Performance: Current AI is actually pretty good at spotting these. If you show them a three-headed sheep, they say, "Hey, that's weird!"
Bucket B: The "Relationship" Glitch (Causal Anomalies)
- The Metaphor: Imagine a video where you drop a rock, but instead of falling down, it floats up. Or you push a car, and it doesn't move because the friction is missing.
- The Problem: The objects are fine, but the rules of how they interact are broken. Gravity is ignored, or momentum doesn't exist.
- The AI's Performance: This is where the AI fails miserably. They often miss these completely. They see the rock floating and might just think, "Oh, it's a magic rock," rather than realizing the laws of physics are broken.

2. The "Adversarial Simulator": Breaking the World on Purpose

Real life videos (like a cat playing with a ball) follow the rules of physics. You can't find a video of a floating rock in nature. So, how do you test if an AI knows physics?

The authors used Generative AI (the same tech that makes fake videos) as a "chaos machine." They asked these AI generators to create videos that look real but contain impossible physics.

They asked the AI: "Make a video of coffee pouring into a cup, but make the liquid level stay the same."
They asked: "Make a video of a bird that is as big as a house."

These "fake" videos became the test questions. If the AI model can spot the coffee level not rising, it understands physics. If it says, "Looks normal," it's just guessing.

3. The "Thinking" Mode: Does Slowing Down Help?

The researchers tested 17 different AI models. Some were "fast thinkers" (System 1), and some were "slow thinkers" (System 2) that were forced to "think" step-by-step before answering.

The Result: The "slow thinkers" did better, but not by much.
The Analogy: It's like asking a student to solve a math problem. If they just guess, they get it wrong. If they write out the steps ("Thinking Mode"), they get it right sometimes. But if they don't actually understand the concept of gravity, even writing down the steps won't save them. The AI is still better at recognizing patterns than understanding cause-and-effect.

4. The Big Takeaway: The "Cognitive Lag"

The paper concludes that we have a Cognitive Lag.

Perception: AI is a super-photographer. It can see every detail.
Prediction: AI is a terrible physicist. It cannot predict what happens next because it doesn't truly understand why things happen.

In a nutshell:
Current Video AIs are like tourists with a camera. They can take a beautiful picture of a waterfall and describe the water, the rocks, and the mist. But if you ask them, "If I throw a stone here, where will it land?" they might guess wrong because they don't actually understand how water and gravity work.

HOCA-Bench is the new test that forces these tourists to stop taking pictures and start doing physics homework. It shows us that while AI is getting smarter at describing the world, it still has a long way to go before it can truly understand it.

1. Problem Statement

While Video Large Language Models (Video-LLMs) have achieved significant progress in semantic perception (identifying "who is doing what"), they lack predictive world modeling—the ability to understand "why it happens" and "how it unfolds" based on physical laws. Current benchmarks primarily test pattern matching and knowledge retrieval but fail to distinguish between models that merely describe visual patterns and those that possess a grounded understanding of physical reality (e.g., gravity, friction, mass conservation).

The core challenge is the scarcity of real-world data containing physical violations (as reality obeys physics), making it difficult to test a model's ability to detect logical fractures. Existing synthetic benchmarks often rely on loosely defined "impossible events" without a unified theoretical framework to categorize why a model fails.

2. Methodology

A. Theoretical Framework: Hegelian Logic

The authors introduce a novel taxonomy based on Hegelian logic, distinguishing between two layers of reality:

Ontological Anomalies (Being/Existence): Failures in Entity Logic. These occur when an object violates its own definition or persistence without external interaction (e.g., a shape mutating spontaneously, an object vanishing, or a biological impossibility like a three-headed sheep).
Causal Anomalies (Essence/Mechanism): Failures in Interaction Logic. These occur when interactions violate physical laws (e.g., gravity, friction, collision, thermodynamics, or optics).

B. Benchmark Construction (HOCA-Bench)

To operationalize this framework, the authors built a testbed of 1,439 videos (generating 3,470 QA pairs) using a "falsification testbed" approach:

Adversarial Simulation: Instead of relying solely on real-world footage, they utilized state-of-the-art generative video models (e.g., Wan 2.1, Sora, HunyuanVideo, Kling) as adversarial simulators. These models naturally produce "hallucinations" that are visually coherent but physically impossible.
Data Curation:
- Synthetic Data (809 videos): Generated via adversarial prompts to stress-test specific logical breakpoints.
- Real-World Data (630 videos): Sourced from Panda-70M and filtered to match the semantic content of the synthetic anomalies, serving as a control group for plausibility checks.
Annotation Pipeline: A "Coarse-to-Fine" process involving VLMs for frame-level captioning, LLMs for aggregating physical logic, and rigorous human verification (5 experts, ~200 man-hours) to ensure high-quality grounding.

C. Task Design

The benchmark evaluates models across four progressive tasks:

Task I (Plausibility Check): Binary judgment (Yes/No) on whether a scenario is physically plausible.
Task II (Domain Categorization): Multi-choice classification into 9 Hegelian physical domains (e.g., Mechanics, Optics, Bio-Behavior).
Task III (Fine-grained Description): Selecting the correct description of the anomaly from adversarial distractors.
Task IV (Counterfactual Reasoning): Open-ended generation requiring a structured 4-part reasoning chain: Scene Description $\rightarrow$ Anomaly Identification $\rightarrow$ Taxonomic Attribution (Ontological vs. Causal) $\rightarrow$ Counterfactual Reasoning (predicting normal outcomes based on physical laws).

3. Key Contributions

Hegelian Taxonomy: The first benchmark to structure physical hallucinations into a rigorous philosophical framework separating Ontological (entity definition) failures from Causal (interaction law) failures.
Adversarial Testbed: A novel methodology using generative video models as "adversarial simulators" to create a controlled, diverse dataset of physical violations that real-world data cannot provide.
Comprehensive Evaluation: Evaluation of 17 Video-LLMs across diverse architectures (Dense vs. MoE), scales (2B to 106B parameters), and reasoning modes (System 1 "Instruct" vs. System 2 "Thinking").
New Metric (H-Index): Introduction of the Hegelian Index, a holistic score averaging performance across perception and reasoning layers to measure predictive world modeling capability.

4. Experimental Results

A. The "Cognitive Lag"

Evaluations reveal a significant gap between Ontological and Causal reasoning:

Models perform well on Ontological tasks (e.g., detecting a shape mutation or a vanishing object), often achieving high accuracy.
Models struggle significantly with Causal tasks (e.g., understanding gravity, friction, or fluid dynamics). Performance drops by >20% on causal tasks compared to ontological ones.
Example: Models often correctly identify that a coffee level isn't rising (Ontological) but fail to explain why (Causal violation of mass conservation), often hallucinating alternative explanations like "upward flow."

B. Impact of Architecture and Reasoning Modes

System-2 "Thinking" Modes: While "Thinking" variants (Chain-of-Thought) improve reasoning depth and help maintain logical consistency, they do not close the gap between Ontological and Causal performance. This suggests current architectures recognize visual patterns more readily than they apply basic physical laws.
Scaling Laws: Performance generally improves with model scale (e.g., InternVL growing from 53.8 to 63.8 H-Index as parameters increase from 2B to 14B).
Open vs. Closed Source: Specialized open-weight models (e.g., Qwen3-VL-32B, H-Index 70.3) outperform top-tier closed-source models (GPT-4o: 57.4, Gemini-2.5-Flash: 67.5), indicating that architectural optimizations for reasoning can surpass general-purpose scaling.

C. Advanced Probes

Temporal Grounding: Models struggle to localize when an anomaly occurs (low mIoU), indicating a disconnect between "what" happens and "when" it happens.
Multi-Anomaly Detection: Even with 64-frame inputs and Thinking modes, models fail to comprehensively identify all concurrent anomalies (Exact Match Accuracy < 18% for top models).
Context Scaling: Increasing input frames from 16 to 64 does not uniformly improve reasoning, suggesting the bottleneck is logical inference, not sensory granularity.

5. Significance and Conclusion

HOCA-Bench establishes a new frontier for auditing Physical Intelligence in AI. It demonstrates that current Video-LLMs are primarily pattern matchers rather than predictive world modelers.

Diagnostic Value: The benchmark provides a principled way to diagnose whether a model's failure is due to a lack of entity recognition (Ontological) or a lack of physical law understanding (Causal).
Future Direction: The results suggest that simply scaling parameters or adding "Thinking" modes is insufficient for true physical intelligence. Future research must focus on integrating explicit physical priors and causal reasoning mechanisms into model architectures to bridge the gap between semantic perception and predictive world modeling.

This work serves as a critical roadmap for progressing Video-LLMs toward Artificial General Intelligence (AGI) by demanding that models not only see the world but understand the laws that govern it.