Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Imagine you are watching a home video of a friend walking through a busy city park. They sit on a bench, jump over a puddle, and high-five a stranger.

Now, imagine trying to turn that flat, 2D video into a 3D movie where you can walk around the characters, see the trees from behind, and understand exactly how their feet touch the ground. This is incredibly hard for computers because the camera is moving, the people are moving, and the background is moving all at once. It's like trying to solve three different jigsaw puzzles at the same time while someone keeps shaking the table.

This paper introduces JOSH (Joint Optimization of Scene Geometry and Human Motion), a new "super-solver" that fixes all three puzzles at once.

Here is how it works, broken down with simple analogies:

1. The Old Way: The "Assembly Line" Mistake

Before JOSH, computers tried to solve this problem in steps, like an assembly line:

Step 1: Guess where the camera is.
Step 2: Guess where the person is.
Step 3: Guess what the background looks like.

The Problem: If you make a tiny mistake in Step 1, it ruins Step 2 and Step 3.

Analogy: Imagine trying to build a house by first guessing the foundation, then guessing the walls, then guessing the roof. If your foundation is slightly off, the walls lean, and the roof falls off. The result is a wobbly, unrealistic house where the person's feet might float in mid-air or sink through the floor.

2. The JOSH Way: The "Group Hug"

JOSH changes the game. Instead of doing things one by one, it looks at the entire picture and adjusts everything simultaneously.

The Secret Sauce: The "Handshake" (Contact)
The biggest clue JOSH uses is touch. When a person sits on a bench or steps on a sidewalk, their body and the world are physically touching.

Analogy: Think of the person and the scene as two people holding hands. If one person moves, the other must move with them to keep holding hands.
JOSH uses this "handshake" as a rule. If the computer thinks the person's foot is floating, JOSH says, "Wait, the ground is right there! Pull the foot down." If the computer thinks the ground is too far away, JOSH says, "No, the person is touching it! Pull the ground closer."

By constantly checking these "handshakes" (contacts), JOSH forces the camera, the person, and the background to agree with each other. They refine each other until the whole scene makes physical sense.

3. What JOSH Actually Does

JOSH takes a regular video from the internet (like a YouTube clip) and outputs three things at once:

The Camera: It figures out exactly how the camera moved through the scene.
The Person: It creates a 3D digital twin of the person, showing exactly how they moved in the real world (not just on the screen).
The World: It builds a detailed 3D map of the background (buildings, trees, sidewalks).

The Result: You get a "4D" reconstruction (3D space + time) where the physics feel real. The person doesn't slide across the floor like a ghost; they grip the ground. They don't walk through walls; they bump into them.

4. Why This Matters: The "Teacher" Analogy

The paper also shows something amazing about JOSH3R, a faster, AI version of JOSH.

The Problem: We don't have enough "perfect" 3D videos to teach AI how to do this. Most real-world videos don't have a "correct answer" sheet.
The Solution: JOSH is so good at solving the puzzle that it can act as a super-teacher. It watches thousands of messy, real-world videos and writes its own "answer keys" (labels).
The Payoff: The authors trained a new, fast AI (JOSH3R) using these self-made answer keys. Surprisingly, this AI learned better from the messy web videos than it did from small, perfect lab datasets.

Analogy: Imagine a student who only learns from perfect textbooks. They might fail in the real world. But if you have a genius tutor (JOSH) who can look at real-life chaos and explain the rules, the student learns much faster and becomes an expert in the real world.

Summary

The Goal: Turn flat videos into realistic 3D worlds with moving people.
The Innovation: Instead of solving the camera, person, and background separately, JOSH solves them all together, using physical touches (like feet on the ground) to keep everything consistent.
The Impact: It creates much more realistic 3D reconstructions and can teach new AI models to understand human movement in the real world without needing expensive, perfect data.

In short, JOSH is the tool that finally lets computers understand that people and places are connected, and you can't understand one without the other.

1. Problem Definition

The paper addresses the challenge of 4D human-scene reconstruction from monocular "in-the-wild" web videos. The goal is to simultaneously recover:

Global Human Motion: The 4D trajectory (3D position + time) of one or multiple humans in world coordinates, represented by SMPL parameters.
Dense Scene Reconstruction: The 3D geometry of the surrounding environment as a dense point cloud.
Camera Poses: The extrinsic parameters (position and orientation) and intrinsics (focal length) of the moving camera.

Key Challenges:

Entanglement: In monocular videos, camera motion and human motion are entangled, making it difficult to disentangle them without strong constraints.
Lack of Ground Truth: Web videos lack metric-scale ground truth for scene geometry or human motion.
Inconsistency in Prior Methods: Existing approaches typically perform sequential optimization (e.g., reconstructing the scene first, then fitting the human, or optimizing them separately). This leads to physically implausible results, such as foot sliding, foot penetration into the ground, or inconsistent scales between the human and the scene.

2. Methodology: JOSH Framework

The authors propose JOSH (Joint Optimization of Scene Geometry and Human Motion), a general optimization framework that performs joint optimization of all parameters in a single stage.

2.1 Core Insight

The central insight is that human-scene contact (e.g., feet touching the ground, hands touching walls) provides strong geometric constraints. These contacts bridge the gap between the human mesh, the scene point cloud, and the camera pose, allowing them to mutually refine each other.

2.2 Pipeline Overview

Initialization:
- Scene: Uses off-the-shelf dense reconstruction models (e.g., MASt3R, MonST3R, DROID-SLAM) to generate initial point maps and correspondences.
- Human: Uses human mesh recovery models (e.g., VIMO, WHAM, HMR2.0) to estimate local SMPL parameters.
- Contact: Uses a contact prediction model (BSTRO) to identify vertices on the human mesh likely to be in contact with the scene.
- Masking: A video segmentation model (DEVA) is used to mask out moving humans from the scene reconstruction to prevent noise in the background geometry.
Joint Optimization:
Instead of sequential steps, JOSH minimizes a unified loss function $\mathcal{L}$ over all parameters ( $\{K_t, P_t, \sigma_t, Z_t, \Theta^t_c\}$ ) simultaneously using a gradient-based optimizer (Adam).

The total loss is composed of:
- $\mathcal{L}_{scene}$ : Standard 3D correspondence and 2D reprojection losses for the static background.
- $\mathcal{L}_{human}$ : Includes temporal smoothness, SMPL priors, and 2D keypoint reprojection losses.
- $\mathcal{L}_{contact}$ (Key Contribution): Two specific losses leveraging contact labels:
  - Contact Scene Loss ( $\mathcal{L}_{c1}$ ): Forces the 3D position of a predicted contact vertex on the human mesh to be spatially close to the corresponding point in the dense scene point cloud. This anchors the human to the scene metric scale.
  - Contact Static Loss ( $\mathcal{L}_{c2}$ ): Enforces that contact points remain static relative to the scene across frames (e.g., a foot planted on the ground should not slide), reducing foot sliding artifacts.
Focal Length Optimization:
Unlike prior works that assume a fixed focal length, JOSH jointly optimizes the camera focal length ( $f$ ) alongside the local root depth. This ensures consistency between the depth estimation and the camera intrinsics, which is critical for metric accuracy in uncalibrated web videos.

2.3 JOSH3R: End-to-End Prediction

To enable real-time inference and scalable training, the authors introduce JOSH3R, an end-to-end neural network.

Training Data: JOSH is used to generate high-quality pseudo-labels (global motion and scene geometry) from ~20 hours of diverse web videos.
Architecture: Built upon MASt3R (a geometric foundation model), JOSH3R adds a lightweight "human trajectory head" to predict relative human transformations ( $\Delta T$ ) between frames.
Inference: It predicts global human motion and camera poses iteratively without the need for iterative optimization, achieving real-time speeds (15.4 FPS).

3. Key Contributions

Unified Optimization Framework: JOSH is the first framework to jointly optimize camera poses, global human motion (multi-person), and dense scene geometry in a single stage, utilizing human-scene contact as a primary constraint.
Contact-Based Constraints: The introduction of Contact Scene Loss and Contact Static Loss significantly improves physical plausibility, eliminating common artifacts like foot sliding and scene penetration.
Scalable Training Pipeline: By using JOSH to generate pseudo-labels from unstructured web data, the authors demonstrate that end-to-end models (JOSH3R) trained on this data outperform models trained on smaller, curated ground-truth datasets.
Metric Scale Recovery: The framework successfully recovers metric-scale scene and motion without requiring pre-calibrated cameras or LiDAR, by leveraging the physical constraints of human contact.

4. Experimental Results

The method was evaluated on SLOPER4D, EMDB, and RICH datasets.

4D Reconstruction Quality:
- JOSH significantly outperforms the baseline SynCHMR (which uses sequential optimization).
- Physics Plausibility: JOSH reduces Foot Sliding (FS) from 67.4mm to 28.2mm and Foot Floating Rate (FFR) from 9.0% to 2.9% compared to baselines.
- Scene Accuracy: On SLOPER4D, JOSH3 (initialized with MASt3R) reduced the Chamfer Distance by 70.1% compared to the baseline.
Global Human Motion Estimation:
- JOSH sets a new State-of-the-Art (SOTA) on the EMDB dataset.
- Using VIMO initialization, JOSH3 achieved a W-MPJPE of 174.7mm, surpassing previous SOTA methods like TRAM (222.4mm) and WHAM (335.3mm).
- It also showed superior performance in Root Translation Error (RTE).
Scalable Training (JOSH3R):
- A model trained on web videos pseudo-labeled by JOSH outperformed a model trained on the ground-truth EMDB dataset by 59.2% in WA-MPJPE.
- JOSH3R achieves 15.4 FPS, enabling real-time inference, whereas the iterative JOSH optimization runs at 0.8 FPS.

5. Significance and Impact

Bridging the Gap: JOSH effectively bridges the gap between constrained, sensor-rich 3D reconstruction and the chaotic reality of "in-the-wild" web videos.
Data Efficiency: It demonstrates that high-quality 3D data can be synthesized from unstructured web videos, potentially solving the data scarcity problem for training large-scale 4D human-scene models.
Physical Realism: By enforcing physical contact constraints, the method produces reconstructions that are not just geometrically accurate but physically plausible, which is crucial for applications like autonomous driving simulation, urban planning, and AR/VR.
Generalizability: The framework is modular and compatible with various state-of-the-art initialization models (e.g., MASt3R, DROID-SLAM), suggesting it can improve as the underlying components improve.

In conclusion, JOSH represents a paradigm shift from sequential to joint optimization in 4D reconstruction, leveraging the physical reality of human-scene interaction to achieve unprecedented accuracy and consistency in monocular video analysis.