SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens

SceMoS is a scene-aware 3D human motion synthesis framework that achieves state-of-the-art realism and contact accuracy by disentangling global planning and local execution through efficient 2D scene representations (BEV images and heightmaps), thereby eliminating the need for computationally expensive 3D volumetric data while reducing trainable parameters by over 50%.

Anindita Ghosh, Vladislav Golyanik, Taku Komura, Philipp Slusallek, Christian Theobalt, Rishabh Dabral

Published 2026-02-25
📖 4 min read☕ Coffee break read

Imagine you are directing a movie scene where an actor needs to walk across a room, sit on a couch, and pick up a cup, all based on a simple instruction like, "Go sit on the couch."

Doing this in the real world is easy for humans because we instinctively know where the walls are, how high the floor is, and how our bodies interact with furniture. But teaching a computer to do this has been a nightmare. Previous methods tried to build a massive, hyper-detailed 3D digital twin of the entire room (like a giant voxel grid or a cloud of millions of points) just to figure out where the actor's feet should go. It's like trying to navigate a city by studying a microscopic map of every single brick in every building—it's incredibly heavy, slow, and computationally expensive.

Enter SceMoS (Scene-Aware Motion Synthesis).

The researchers behind this paper asked a simple question: "Do we really need to see every single brick to know how to walk?"

Their answer is a resounding no. Instead of building a heavy 3D model, they built a "smart, two-step thinking process" that uses lightweight 2D pictures to guide the actor.

Here is how it works, broken down into everyday analogies:

1. The Two-Step Brain: The Architect and the Builder

SceMoS splits the job into two distinct roles, just like a construction project:

  • The Architect (Global Planner):

    • The Job: This part looks at the big picture. It answers: "Where is the couch? Where is the door? What is the general layout?"
    • The Tool: Instead of a 3D model, it looks at a Bird's-Eye View (BEV) image. Imagine a drone hovering high up in the corner of the room, taking a photo of the floor plan.
    • The Magic: It uses a super-smart AI (called DINOv2) that can "read" this photo. It understands that the brown blob is a couch and the open space is a hallway. It doesn't need to know the texture of the fabric; it just needs to know where things are. This allows it to plan the route efficiently.
  • The Builder (Local Execution):

    • The Job: This part handles the nitty-gritty physics. It answers: "Is the floor flat here? Is there a step? How do I bend my knees to sit without falling through the chair?"
    • The Tool: It uses a 2D Heightmap. Imagine a topographic map (like you see on hiking trails) that only shows the ground directly under the actor's feet. It's a simple grid showing "high" (furniture) and "low" (floor).
    • The Magic: This acts as a "physics cheat sheet." It tells the actor's legs exactly how to move to stay on the ground or interact with the surface right in front of them.

2. The "Vocabulary" of Movement

One of the coolest tricks in this paper is how they teach the computer to move.

Instead of calculating every muscle movement from scratch (which is slow), they created a dictionary of movement "tokens" (like words in a sentence).

  • Old Way: "Calculate the angle of the knee, the velocity of the hip, the friction of the shoe..." (Too much math!).
  • SceMoS Way: They trained a system to learn that a specific "word" (token) means "Bend knees to sit on a surface that is 45cm high."

Because this dictionary is trained while looking at the 2D heightmap, the "words" themselves are geometry-grounded. The computer doesn't just learn "sit"; it learns "sit on this specific type of surface." This ensures the actor never walks through a wall or floats in mid-air.

3. Why This is a Game-Changer

Think of the previous methods as trying to drive a car by looking at a 3D scan of every pebble on the road. It works, but the engine (the computer) overheats, and the car moves slowly.

SceMoS is like driving with a GPS and a road map:

  • Efficiency: It uses 2D images (which are tiny files) instead of massive 3D clouds. This reduces the computer's memory usage by over 50%.
  • Speed: It plans the route and executes the steps separately, making the whole process much faster and smoother.
  • Realism: Because the "Builder" checks the local heightmap constantly, the actor's feet stay planted on the ground, and they don't clip through furniture.

The Bottom Line

SceMoS proves that you don't need a supercomputer to simulate realistic human movement in a room. By using a drone's eye view for the big plan and a hiker's topographic map for the footwork, the system creates lifelike, collision-free animations that are smart, fast, and surprisingly simple.

It's the difference between trying to memorize the entire library of Congress to find one book, versus just using a card catalog and a map to get there.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →