PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

Imagine you want to teach a robot to dance based on a story you tell it. You say, "The robot walks to the door, opens it, and does a spin."

For a long time, AI researchers tried to solve this by giving the robot a single, giant "thought" for every split-second of the dance. It was like trying to describe a whole orchestra's performance by writing down just one number for the entire room. The AI had to guess which number meant the violin, which meant the drums, and how they moved together. This led to clumsy, jittery dances where the robot's feet might slide across the floor or its arms would twist unnaturally.

PRISM is a new system that changes the rules of the game. Instead of one giant thought, it gives the robot a personalized instruction card for every single joint in its body (shoulders, elbows, knees, etc.) for every moment in time.

Here is how PRISM works, broken down into three simple concepts:

1. The "Individual Seat" Analogy (Joint-Factorized Latent Space)

The Old Way: Imagine a crowded bus where everyone is squished into one giant pile. To get to your seat, you have to push through the whole mess. The AI had to do this with every movement, trying to untangle the left foot from the right hip in a messy pile of data.

The PRISM Way: PRISM builds a 2D grid of individual seats.

Time is the row (Frame 1, Frame 2, Frame 3...).
Joints are the columns (Head, Left Arm, Right Leg...).
Every single joint gets its own dedicated "seat" with its own instruction card.

Because the AI doesn't have to guess which part of the data belongs to which joint, it can focus purely on making that specific joint move beautifully. It's like a conductor giving a specific note to every musician in an orchestra, rather than shouting "Play music!" at the whole group. This alone made the movements 18 times more accurate than previous methods.

2. The "Clean vs. Dirty" Analogy (Noise-Free Condition Injection)

The Problem: Usually, if you want an AI to continue a dance from a specific pose, or to switch from "walking" to "running," you need a different AI model or a complex patchwork of tools. It's like trying to change a movie's plot halfway through by splicing in a different film reel; the edges often look jagged.

The PRISM Solution: PRISM uses a clever trick called "Noise-Free Condition Injection."
Imagine the AI is an artist painting a picture.

The "Dirty" parts: The parts of the picture the AI needs to invent (the future dance moves) are covered in fog (noise). The AI's job is to wipe the fog away to reveal the image.
The "Clean" parts: The parts you already know (the starting pose or the text prompt) are left fog-free.

PRISM can look at the "clean" parts (the starting pose) and the "foggy" parts (the future moves) at the exact same time. It knows exactly which parts to keep and which parts to paint. This allows it to:

Start a dance from a text description.
Start a dance from a specific photo of a pose.
Seamlessly chain them together: It can finish a dance, take the last few frames, and immediately use them as the "clean" starting point for the next dance, creating an infinite stream of motion without any jarring jumps.

3. The "Rehearsal" Analogy (Self-Forcing)

The Problem: When you chain many dance segments together, small mistakes add up. If the AI makes a tiny error in the first 10 seconds, by minute 5, the robot might be walking backward or floating in the air. This is called "drift."

The PRISM Solution: The researchers taught the AI to practice with its own mistakes.
Usually, when training, the AI is given the perfect answer after every step (like a teacher correcting a student instantly). But in the real world, the AI has to rely on its own previous moves.
PRISM uses a technique called Self-Forcing. During training, the AI generates a move, makes a mistake, and then has to continue the dance based on that mistaken move. It learns to correct itself and stay on track, just like a dancer who trips but recovers smoothly instead of falling over. This allows it to generate dances that are 10+ times longer than what it was originally trained on, without falling apart.

The Big Picture

PRISM is a single, unified "Motion Foundation Model" that can:

Turn text into dance.
Turn a photo into a dance.
Chain hundreds of actions together into a long story.

It achieves this not by making the AI "bigger" or "smarter" in a general sense, but by organizing the data better (giving every joint its own seat) and teaching it how to handle its own mistakes. The result is human motion that is smoother, more realistic, and capable of telling long, complex stories without breaking a sweat.

Here is a detailed technical summary of the paper "PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition."

1. Problem Statement

Despite rapid advancements in text-to-motion generation, existing methods face two critical bottlenecks:

Latent Space Entanglement: Current motion autoencoders compress an entire frame into a single, monolithic latent vector. This entangles heterogeneous signals (global trajectory, root orientation, and per-joint rotations) which have different physical units and temporal dynamics. Downstream generators must implicitly disentangle these signals, consuming model capacity that should be dedicated to semantic understanding, leading to artifacts like jitter, foot sliding, and unnatural transitions.
Fragmented Generation Paradigms & Error Accumulation: Text-to-motion (T2M), pose-conditioned generation (TP2M), and long-horizon sequential synthesis are typically handled by separate models or require complex task-specific mechanisms (e.g., inpainting networks, specialized positional encodings). Furthermore, autoregressive approaches suffer from severe error accumulation over long rollouts, causing trajectory drift and motion collapse.

2. Methodology: PRISM

PRISM (Per-joint Representation for Infinite Streaming Motion) addresses these challenges through a unified framework consisting of two core components: a Joint-Factorized Causal Motion VAE and a Flow-Matching DiT with Noise-Free Condition Injection.

A. Joint-Factorized Causal Motion VAE

Instead of compressing a frame into one vector, PRISM decomposes the motion along the kinematic tree:

Tokenization: Each body joint (root translation, global orientation, and $J$ joint rotations) is assigned its own token. A frame is represented as a structured 2D grid of shape $(T \times K)$ , where $T$ is time and $K$ is the number of joints.
Architecture: A causal spatio-temporal VAE processes this grid.
- Temporal: Strictly causal convolutions process each joint's time series independently, enabling incremental encoding without reprocessing history.
- Spatial: Joint-attention layers allow tokens within the same frame to interact, capturing kinematic couplings (e.g., arm-leg coordination).
Forward-Kinematics (FK) Supervision: The VAE is trained in native SMPL rotation space. To address the supervision gap where small rotation errors cascade into large positional errors, the loss function includes FK-derived terms that penalize deviations in 3D joint positions and cumulative trajectory, ensuring geometric consistency.

B. Flow-Matching DiT with Noise-Free Condition Injection

The generative model is a Diffusion Transformer (DiT) based on Flow Matching, operating on the 2D latent grid.

Noise-Free Condition Injection: Unlike standard diffusion where all tokens share a single timestep, PRISM assigns a unique timestep embedding to each token.
- Conditioning frames (e.g., text prompts for T2M, or initial poses for TP2M) are injected as clean tokens with $t=0$ .
- Generation targets are noisy tokens with $t>0$ .
- This allows a single model to seamlessly unify T2M, pose-conditioned generation, and autoregressive chaining without architectural changes or post-hoc blending.
Autoregressive Streaming: The model generates motion segment-by-segment. The last $F$ frames of a generated segment are re-encoded and injected as clean tokens ( $t=0$ ) into the next segment's latent grid alongside a new text prompt.
Self-Forcing Training: To prevent drift in long sequences, the training pipeline simulates the actual inference process. The model generates a segment, decodes it, re-encodes it, and uses this imperfect output as the condition for the next segment. This "Self-Forcing" strategy (using Distribution Matching Distillation) teaches the model to remain stable when conditioned on its own outputs, rather than ground truth.

3. Key Contributions

Per-Joint Latent Decomposition: The authors demonstrate that structuring the latent space as a 2D grid of per-joint tokens (rather than a monolithic vector) is a critical, often underestimated factor. This design alone, without changing the generator, significantly improves generation quality by allowing the model to directly learn per-joint dynamics.
Unified Generation Framework: By introducing noise-free condition injection, PRISM unifies T2M, pose-conditioned generation, and infinite-length streaming synthesis into a single foundation model.
Stable Long-Horizon Synthesis: Through the combination of noise-free chaining and self-forcing training, PRISM achieves stable generation of sequences with 10+ consecutive segments (far beyond the ~12s training horizon) without trajectory drift or collapse.

4. Experimental Results

PRISM was evaluated on HumanML3D, MotionHub, BABEL, and a custom 50-scenario user study for narrative composition.

Text-to-Motion (HumanML3D & MotionHub): PRISM achieves State-of-the-Art (SOTA) results. On HumanML3D, it reduces FID by 55% compared to the previous best (ViMoGen) and achieves an R-Precision of 0.893 (within 1.4% of real motion).
Pose-Conditioned Generation: PRISM outperforms specialized baselines (FlowMDM, MotionStreamer) across 1, 5, and 9 frame conditioning granularities, demonstrating that noise-free injection is superior to inpainting networks.
Long-Horizon Sequential Generation (BABEL): PRISM achieves SOTA in both subsequence quality and transition smoothness (Peak Jerk and Area Under Jerk), significantly outperforming MotionStreamer and FlowMDM.
Ablation Studies:
- Replacing the monolithic VAE with the joint-factorized VAE improved MPJPE by 18x and rFID by 20x.
- Self-forcing training reduced drift metrics significantly compared to standard teacher-forcing.
User Study: In a 50-scenario narrative composition study, PRISM was preferred over MotionStreamer in >70% of trials across motion quality, text fidelity, transition smoothness, and overall preference.

5. Significance

PRISM represents a paradigm shift in human motion generation. It argues that latent space design is as critical as generator architecture scaling. By respecting the physical structure of human motion (kinematic tree) and unifying generation regimes through a novel conditioning mechanism, PRISM enables the creation of coherent, physically plausible, and infinitely long motion sequences from natural language. This capability is essential for applications in gaming, film, VR, and embodied AI where long-duration, context-aware motion is required. The code is set to be open-sourced, facilitating further research in structured latent spaces for generative models.

PRISM: Streaming Human Motion Generation with Per-Joint Latent Decomposition

1. The "Individual Seat" Analogy (Joint-Factorized Latent Space)

2. The "Clean vs. Dirty" Analogy (Noise-Free Condition Injection)

3. The "Rehearsal" Analogy (Self-Forcing)

The Big Picture

1. Problem Statement

2. Methodology: PRISM

A. Joint-Factorized Causal Motion VAE

B. Flow-Matching DiT with Noise-Free Condition Injection

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers