MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

Imagine you are watching a movie on a flat TV screen. You can see a skateboarder zooming past, but the screen is just a 2D picture. A regular computer (or a standard AI) sees only pixels moving left and right. It doesn't truly "know" that the skateboarder is actually moving toward the camera, or how far away they are in real 3D space.

MLLM-4D is like giving that computer a pair of "3D glasses" and a "time machine" simultaneously. It teaches the AI to stop just looking at the picture and start imagining the physics of the scene.

Here is the paper broken down into simple concepts and analogies:

1. The Problem: The "Flat World" AI

Current AI models are like people who have only ever lived in a 2D comic book. They are great at recognizing that a dog is in a picture, but if the dog runs toward the camera, the AI might just think the dog is getting "bigger" on the page. It struggles to understand:

Depth: How far away is the object?
Time: How did the object get there?
Motion: Is the object moving, or is the camera moving?

Humans do this naturally. We look at a video and instantly know, "That car is 10 meters away and closing in fast." This paper calls that ability 4D Intelligence (3D Space + Time).

2. The Solution: MLLM-4D

The researchers built a new training system called MLLM-4D. Think of it as a "Gym for AI Brains" designed specifically to teach them how to navigate a 3D world over time.

They didn't just build a new brain; they built a new gym with three specific areas:

A. The Data Factory (The "Stereo Video" Machine)

To teach an AI to see in 3D, you need 3D data. But labeling 3D data by hand is like trying to paint a masterpiece with a toothbrush—it's slow and expensive.

The Analogy: Imagine you have a pile of old, flat comic books (monocular videos). The researchers built a machine that takes these flat comics and uses special "stereo" lenses (like 3D glasses) to reconstruct the 3D world behind them.
What they did: They took existing video datasets and automatically calculated the exact 3D coordinates of every object and the camera for every single frame. This created a massive library of 2 million practice problems (the MLLM4D-2M dataset) where the AI can learn the rules of physics.

B. The "Thinking" Coach (ST-CoT)

Before the AI can solve a hard math problem, it needs to learn how to show its work.

The Analogy: Instead of just asking the AI "How far is the skateboarder?", the researchers taught it to write a step-by-step diary before answering.
The Method: They use a technique called Spatiotemporal Chain of Thought (ST-CoT). The AI is forced to say:
1. Where was the camera at the start? (Coordinates)
2. Where was the skateboarder at the start? (Coordinates)
3. What happened in between? (Did the skateboarder get bigger? Did the background shift?)
4. Where are they now?
5. Therefore, the distance is X.
  This forces the AI to act like a visual physics engine rather than a guesser.

C. The Referee (ST-Reward)

In normal AI training, if the AI gets the right answer, it gets a gold star. But in 4D, the AI might get the right answer by luck (guessing "2 meters" when the answer is "2 meters") but have the wrong reasoning.

The Analogy: Imagine a referee in a sports game. If a player scores a goal but tripped the referee, the goal doesn't count.
The Method: The researchers created a special Spatiotemporal Reward. The AI gets points not just for the right answer, but for correctly calculating the 3D coordinates in its "diary." If the AI hallucinates (makes up) a movement that violates physics, it gets penalized. This ensures the AI learns the truth about how space and time work.

3. The Results: From "Guessing" to "Knowing"

When they tested this new AI:

Old AI: Looked at a video of a skateboarder and guessed the distance based on how "big" the skateboarder looked. It often got it wrong.
MLLM-4D: Looked at the video, calculated the camera's movement, tracked the skateboarder's 3D path, and gave a precise distance (e.g., "2.4 meters").

It outperformed even the most expensive, "closed-source" AI models (like the ones from Google or OpenAI) on these specific 3D-time tasks.

Summary Analogy

Think of the old AI as a tourist looking at a map. They can see the lines and the names, but they don't know how far it is to walk or how long it takes.

MLLM-4D is like giving that tourist a GPS, a pedometer, and a stopwatch all at once. It doesn't just see the map; it understands the journey. It can tell you, "The skateboarder is 2.4 meters away because I tracked their movement frame-by-frame and calculated the physics of their path."

This is a huge step forward for robots, self-driving cars, and VR, because these systems need to understand not just what is in front of them, but how it is moving through space and time.

1. Problem Statement

Current Multimodal Large Language Models (MLLMs) excel at static image and video understanding but lack 4D spatiotemporal intelligence—the ability to perceive and reason about the evolution of 3D space over time from purely 2D visual inputs.

The Gap: While MLLMs can describe static scenes or simple actions, they struggle with dynamic reasoning tasks involving moving objects and cameras (e.g., calculating absolute distances between a camera and an object at different time steps, or determining relative motion trajectories).
Limitations of Existing Approaches:
- Data Scarcity: Existing 4D benchmarks (e.g., VLM4D) rely on manual annotations, resulting in small-scale datasets (few thousand samples) that are insufficient for training large models.
- Model Architecture: Many existing methods rely on auxiliary 3D spatial encoders or depth priors, which often fail in dynamic scenarios because their knowledge is constrained to static environments.
- Reasoning Deficit: Current models often hallucinate motion or rely on 2D pixel displacement rather than understanding the underlying 4D physical laws.

2. Methodology

The authors propose MLLM-4D, a comprehensive framework that enhances MLLMs for 4D reasoning without modifying the underlying architecture. The approach consists of three main pillars:

A. Scalable Spatial-Temporal Data Curation

To address data scarcity, the authors developed an automated pipeline that repurposes existing stereoscopic video datasets (specifically Stereo4D) to generate high-quality 4D instructional data.

Metadata Extraction: The pipeline extracts per-frame camera poses ( $C_i$ ), metric 3D point clouds ( $P_i$ ), and fine-grained semantic descriptions for moving objects using tools like GroundedSAM2 and video MLLMs (Gemini-2.5-flash).
Physics-Based Solver: A "Physical-based Spatial-Temporal Relationship Solver" uses geometric laws to compute ground-truth answers for various tasks (e.g., calculating Euclidean distance between camera and object centers, relative direction changes).
Datasets Generated:
- MLLM4D-2M: 2 million high-quality QA pairs for Supervised Fine-Tuning (SFT).
- MLLM4D-R1-30k: 30k QA pairs with complex reasoning trajectories for Reinforcement Fine-Tuning (RFT).
- MLLM4D-Bench: A comprehensive evaluation benchmark with 6k questions across six subtasks (Independent Object Motion, Camera Ego-Motion, and Object-Camera Dynamics).

B. Specialized Training Framework

The framework employs a two-stage post-training strategy to bridge the gap between 2D perception and 4D reasoning:

Supervised Fine-Tuning (SFT): Trains the model on the MLLM4D-2M dataset to establish a foundational understanding of 4D spatial-temporal anchors (camera/object positions).
Reinforcement Fine-Tuning (RFT) with GRPO:
- Spatiotemporal Chain-of-Thought (ST-CoT): A specialized prompting strategy forces the model to act as a "visual physics engine." The model must output a 5-step reasoning process:
  1. Define objective and temporal anchors.
  2. Parse 3D state at the start frame (Camera/Object centers).
  3. Analyze temporal progression and visual cues.
  4. Verify 3D state at the end frame.
  5. Synthesize evidence to derive the answer.
- Spatiotemporal Reward (ST-Reward): Unlike standard accuracy rewards, this reward function penalizes hallucinations by checking the physical consistency of the predicted coordinates (Camera Center and Object Center) against ground truth using an exponential decay function based on Mean Euclidean Error (MEE).
- Algorithm: The model is optimized using Group Relative Policy Optimization (GRPO), which eliminates the need for a separate value function by comparing relative rewards within a group of sampled outputs.

3. Key Contributions

MLLM-4D Framework: A novel framework that achieves state-of-the-art 4D reasoning using standard MLLM architectures (e.g., Qwen3-VL) without requiring additional 3D encoders, relying instead on high-quality data and training strategies.
Automated Data Engine: A scalable pipeline that converts stereo videos into 2M+ 4D instructional samples, solving the bottleneck of data scarcity in spatiotemporal reasoning.
ST-CoT and ST-Reward: The introduction of a physics-grounded reasoning format (ST-CoT) and a coordinate-based reward function (ST-reward) that effectively suppresses temporal hallucinations and enforces physical consistency.
Comprehensive Benchmarking: The release of MLLM4D-Bench and MLLM4D-R1-30k, providing a rigorous evaluation suite for dynamic scene understanding.

4. Experimental Results

Extensive experiments demonstrate that MLLM-4D significantly outperforms both proprietary and open-source baselines.

Performance on MLLM4D-Bench:
- MLLM-4D (Qwen3-VL-8B) achieved an average score of 72.7%, surpassing the best proprietary model (Gemini 2.5 Pro at 46.6%) and other open-source models (Qwen3-VL-8B baseline at 35.3%) by a large margin.
- It showed particularly strong performance in Object-Camera Dynamics (70.2% average), a task where most baselines fail.
Generalization: On the out-of-distribution VLM4D benchmark, MLLM-4D achieved 61.0%, outperforming specialized 3D spatial reasoning models like VG-LLM and VLM-3R.
Ablation Studies:
- Removing the ST-reward resulted in a performance drop, confirming the necessity of physical grounding.
- Scaling the data from 10k to 2M samples showed consistent performance improvements, validating the scalability of the data curation pipeline.
- The pipeline using stereo videos significantly outperformed an alternative pipeline using monocular videos, highlighting the importance of metric depth accuracy.

5. Significance

Advancing Embodied AI: This work is critical for interactive AI systems (robotics, VR/AR, autonomous driving) that require continuous understanding of evolving spatial relationships in dynamic environments.
Paradigm Shift: It demonstrates that data quality and reasoning structure are more critical than architectural complexity for 4D intelligence. Standard MLLMs can master 4D reasoning if trained with physics-grounded data and reward signals.
Open Science: The release of the datasets (MLLM4D-2M, R1-30k) and the benchmark (MLLM4D-Bench) provides a crucial foundation for future research in spatiotemporal intelligence, moving the field beyond static 3D reasoning.

In conclusion, MLLM-4D successfully bridges the gap between 2D visual inputs and 4D physical reasoning, establishing a new state-of-the-art for understanding dynamic worlds through a combination of automated data synthesis and physics-aware reinforcement learning.