SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

Imagine you are directing a movie scene where an actor needs to walk across a room, sit on a couch, and pick up a cup, all based on a simple instruction like, "Go sit on the couch."

Doing this in the real world is easy for humans because we instinctively know where the walls are, how high the floor is, and how our bodies interact with furniture. But teaching a computer to do this has been a nightmare. Previous methods tried to build a massive, hyper-detailed 3D digital twin of the entire room (like a giant voxel grid or a cloud of millions of points) just to figure out where the actor's feet should go. It's like trying to navigate a city by studying a microscopic map of every single brick in every building—it's incredibly heavy, slow, and computationally expensive.

Enter SceMoS (Scene-Aware Motion Synthesis).

The researchers behind this paper asked a simple question: "Do we really need to see every single brick to know how to walk?"

Their answer is a resounding no. Instead of building a heavy 3D model, they built a "smart, two-step thinking process" that uses lightweight 2D pictures to guide the actor.

Here is how it works, broken down into everyday analogies:

1. The Two-Step Brain: The Architect and the Builder

SceMoS splits the job into two distinct roles, just like a construction project:

The Architect (Global Planner):
- The Job: This part looks at the big picture. It answers: "Where is the couch? Where is the door? What is the general layout?"
- The Tool: Instead of a 3D model, it looks at a Bird's-Eye View (BEV) image. Imagine a drone hovering high up in the corner of the room, taking a photo of the floor plan.
- The Magic: It uses a super-smart AI (called DINOv2) that can "read" this photo. It understands that the brown blob is a couch and the open space is a hallway. It doesn't need to know the texture of the fabric; it just needs to know where things are. This allows it to plan the route efficiently.
The Builder (Local Execution):
- The Job: This part handles the nitty-gritty physics. It answers: "Is the floor flat here? Is there a step? How do I bend my knees to sit without falling through the chair?"
- The Tool: It uses a 2D Heightmap. Imagine a topographic map (like you see on hiking trails) that only shows the ground directly under the actor's feet. It's a simple grid showing "high" (furniture) and "low" (floor).
- The Magic: This acts as a "physics cheat sheet." It tells the actor's legs exactly how to move to stay on the ground or interact with the surface right in front of them.

2. The "Vocabulary" of Movement

One of the coolest tricks in this paper is how they teach the computer to move.

Instead of calculating every muscle movement from scratch (which is slow), they created a dictionary of movement "tokens" (like words in a sentence).

Old Way: "Calculate the angle of the knee, the velocity of the hip, the friction of the shoe..." (Too much math!).
SceMoS Way: They trained a system to learn that a specific "word" (token) means "Bend knees to sit on a surface that is 45cm high."

Because this dictionary is trained while looking at the 2D heightmap, the "words" themselves are geometry-grounded. The computer doesn't just learn "sit"; it learns "sit on this specific type of surface." This ensures the actor never walks through a wall or floats in mid-air.

3. Why This is a Game-Changer

Think of the previous methods as trying to drive a car by looking at a 3D scan of every pebble on the road. It works, but the engine (the computer) overheats, and the car moves slowly.

SceMoS is like driving with a GPS and a road map:

Efficiency: It uses 2D images (which are tiny files) instead of massive 3D clouds. This reduces the computer's memory usage by over 50%.
Speed: It plans the route and executes the steps separately, making the whole process much faster and smoother.
Realism: Because the "Builder" checks the local heightmap constantly, the actor's feet stay planted on the ground, and they don't clip through furniture.

The Bottom Line

SceMoS proves that you don't need a supercomputer to simulate realistic human movement in a room. By using a drone's eye view for the big plan and a hiker's topographic map for the footwork, the system creates lifelike, collision-free animations that are smart, fast, and surprisingly simple.

It's the difference between trying to memorize the entire library of Congress to find one book, versus just using a card catalog and a map to get there.

1. Problem Statement

The core challenge addressed is text-driven 3D human motion synthesis within realistic 3D scenes. Current methods face a fundamental trade-off between semantic intent (e.g., "walk to the couch") and physical feasibility (e.g., avoiding collisions, maintaining contact with the ground).

Limitations of Existing Approaches:
- Scene Representation: Most state-of-the-art (SOTA) methods rely on computationally expensive 3D representations like voxel grids, point clouds, or signed distance fields (SDFs). These require heavy 3D backbones (e.g., volumetric CNNs, Transformers) and introduce redundant spatial detail for tasks dominated by near-surface interactions.
- Entangled Learning: Current frameworks often attempt to learn high-level planning and low-level contact reasoning simultaneously within a single, entangled process. This makes training difficult, reduces generalization, and increases the number of trainable parameters significantly (often ~50M+ for scene encoding).
- Data Scarcity: Realistic scenes often have noisy, unlabeled assets, making dense 3D supervision difficult.

The authors ask: Can structured 2D scene representations provide sufficient cues for physically grounded motion synthesis without the cost of full 3D volumetric reasoning?

2. Methodology: SceMoS Framework

SceMoS proposes a two-stage, disentangled framework that separates global motion planning from local physical execution. It relies entirely on lightweight 2D scene cues rather than dense 3D data.

A. Scene Representation (2D Factorization)

Instead of 3D volumes, the scene is represented by two complementary 2D modalities:

Global Layout (Bird's-Eye-View - BEV): A single BEV RGB image rendered from an elevated corner of the scene. This is processed by a DINOv2 vision foundation model to extract semantic features ( $F_{dino}$ ). This captures spatial layout, walkable areas, and object locations (e.g., "couch," "table").
Local Geometry (Heightmap): A 2D heightmap ( $H$ ) centered on the character's root joint, representing the local surface topology. This is used to enforce fine-grained physical constraints (contact, penetration).

B. Stage 1: Global Motion Planner

Architecture: A causal autoregressive Transformer.
Input: Text embeddings ( $F_{text}$ from T5) + Global BEV features ( $F_{dino}$ ).
Function: Predicts a sequence of discrete motion tokens ( $\{z_i\}$ ).
Mechanism: It operates in a discrete latent space, planning the high-level trajectory and semantic intent (e.g., "move towards the table") without worrying about immediate foot placement details. It uses Classifier-Free Guidance (CFG) for robust conditioning.

C. Stage 2: Geometry-Grounded Motion Tokenizer

Architecture: A conditional VQ-VAE (Vector Quantized Variational Autoencoder).
Codebook: Learns a vocabulary of motion primitives ( $K=1024$ codes).
Key Innovation: Unlike standard motion tokenizers, the decoder is explicitly conditioned on the local 2D heightmap ( $H$ $H$ ) corresponding to the previous pose.
- Reconstruction: $\hat{X} = D(Z_q, H)$
- Effect: The discrete tokens in the codebook are forced to encode not just kinematic patterns but also geometry-specific behaviors (e.g., "bend knees to touch a surface at height $h$ "). This embeds surface physics directly into the token vocabulary.
Training: Uses a composite loss including motion reconstruction (MPJPE, velocities, contacts) and commitment loss.

D. Inference Loop & Trajectory Refinement

Autoregressive Generation: The planner generates tokens based on text and BEV.
Decoding: Each token is decoded into a continuous motion segment using the geometry-conditioned decoder.
Recalculation: After decoding a segment, the system updates the character's position, recalculates the local heightmap, and re-renders the BEV snapshot. This allows the planner to extend long-horizon trajectories seamlessly.
Refinement: A lightweight trajectory refinement module predicts smoothed root velocities to eliminate foot-sliding artifacts caused by minor trajectory estimation errors.

3. Key Contributions

Disentangled 2D Framework: A novel architecture that separates semantic planning (via BEV) from physical execution (via local heightmaps), eliminating the need for dense 3D volumetric inputs.
Geometry-Grounded Tokenization: A conditional VQ-VAE that learns a motion vocabulary where discrete tokens inherently encode physical interactions with local surface geometry, bridging the gap between linguistic intent and physical reality.
Efficiency-Fidelity Trade-off: Demonstrates that 2D projections (BEV + Heightmaps) capture sufficient affordance and geometry for high-quality synthesis, reducing trainable scene parameters by >50% (from ~50M to ~4M) compared to voxel/point-cloud baselines.

4. Experimental Results

Evaluated on the TRUMANS dataset (100 indoor scenes, 15 hours of motion data).

Quantitative Performance:
- Motion Realism: Achieved the lowest Fréchet Inception Distance (FID = 0.31), outperforming voxel-based baselines like TRUMANS (0.34) and SceneDiffuser (0.75).
- Contact Accuracy: Achieved the highest physical foot contact score (0.98) and lowest penetration rates, matching or exceeding SOTA methods.
- Efficiency: Requires only ~4M trainable scene parameters, compared to ~86M for TRUMANS and ~55M for Humanise/SceneDiffuser.
Ablation Studies:
- Removing the two-stage design (A5) significantly degraded fidelity and contact.
- Replacing DINOv2 with CLIP (A6) resulted in poor motion fidelity, highlighting the importance of DINOv2's spatial layout understanding.
- Using 3D voxel grids (A3) instead of 2D heightmaps offered no significant benefit and increased complexity.
- The trajectory refinement module was crucial for reducing foot-sliding.
Qualitative Results: Visualizations show SceMoS generates semantically consistent motions (e.g., sitting correctly on a chair) with stable contact, whereas baselines often exhibit surface penetrations or misalignment.

5. Significance and Impact

Paradigm Shift: Challenges the prevailing assumption that high-fidelity 3D human-scene interaction requires heavy 3D scene representations. It proves that structured 2D cues are sufficient for grounding 3D motion.
Scalability: By reducing the computational burden of scene encoding by an order of magnitude, SceMoS makes scene-aware motion synthesis more scalable and accessible for applications in robotics, VR, and animation.
Generalization: The method demonstrates that separating planning from execution allows for better generalization to complex, noisy scenes without requiring explicit 3D labeling.

Limitations: The current system assumes static scenes and focuses on macro-scale interactions (walking, sitting). Fine-grained manipulation (e.g., grasping small objects) and outdoor/uneven terrain scenarios remain challenges for future work.