Controllable Text-to-Motion Generation via Modular Body-Part Phase Control

Imagine you are directing a movie with a digital actor. You give the actor a simple instruction: "Walk across the room and wave hello."

In the past, if you wanted to tweak that performance—say, make the wave bigger, or have the actor start waving a split-second earlier—you were stuck. You either had to re-record the whole scene from scratch, or you had to act like a robot, manually adjusting the coordinates of every single finger and joint for every single frame. It was like trying to fix a typo in a book by rewriting the entire novel.

This paper introduces a new, much smarter way to do this. They call it Modular Body-Part Phase Control.

Here is the simple breakdown of how it works, using some everyday analogies:

1. The Problem: The "All-or-Nothing" Approach

Current AI motion generators are great at creating a full scene based on text, but they are terrible at fine-tuning. If you ask for a "big wave," the AI might make the whole body swing wildly, or it might ignore you completely. Existing methods that try to fix this are like trying to steer a ship by pushing on individual rivets; it's too complicated and messy for a human to use easily.

2. The Solution: The "Musical Metronome"

The authors realized that human movement is rhythmic. When you walk, your legs swing back and forth like a pendulum. When you wave, your arm moves in a repeating loop.

In physics and music, we describe these loops using Phase. Think of a song:

Amplitude (A): How loud the music is (or how big the wave is).
Frequency (F): How fast the beat is (or how fast the leg steps).
Phase Shift (S): When the beat starts (or when the wave begins).

The paper's big idea is: Instead of controlling the actor's joints directly, let's control the "music" of their body parts.

3. How It Works: The "Radio Tuner"

The system works in three simple steps:

Step 1: The Translator (The Phase Extractor)
The AI looks at a reference motion (like a video of someone waving) and translates that movement into a simple "musical score" for each body part. It doesn't care about the specific angles of the elbow; it just says, "The right arm is waving with a loudness of 5, a speed of 3, and it starts at beat 2."
Step 2: The Editor (The User Interface)
This is the magic part. You, the user, get a simple slider for each body part.
- Want the wave bigger? Slide the Amplitude up.
- Want the walk faster? Slide the Frequency up.
- Want the hand to wave before the person speaks? Slide the Phase Shift back.
It's like using a volume knob or a tempo slider on a music player. You aren't rewriting the song; you're just adjusting the knobs.
Step 3: The Conductor (The Phase ControlNet)
The AI takes your new "knob settings" and injects them into the main generator. Think of the main generator as a talented orchestra playing a symphony. Your "Phase Control" is a conductor standing on a podium, gently tapping the violin section (the right arm) to play louder, while telling the cello section (the legs) to keep playing exactly as before. The rest of the body stays perfectly natural and coherent.

4. Why This is a Game-Changer

It's Plug-and-Play: You can attach this "conductor" to almost any existing motion AI (whether it uses diffusion or flow models) without breaking the original system.
It's Predictable: If you turn the "Speed" knob up by 10%, the motion gets exactly 10% faster. No surprises.
It's Localized: You can make the right hand wave wildly while the left hand stays perfectly still, and the legs keep walking normally. The AI understands that these are separate "instruments" in the body's orchestra.

The Bottom Line

This paper gives us a remote control for human motion. Instead of being a puppet master pulling thousands of invisible strings, we can now just turn a few dials to make a digital character's arm wave bigger, their walk faster, or their gesture happen sooner—all while keeping the rest of their body moving naturally. It turns complex, scary math into simple, intuitive sliders.

1. Problem Statement

Text-to-Motion (T2M) generation has advanced significantly using diffusion and flow-based models, producing realistic human motions from natural language. However, a critical bottleneck remains: fine-grained, localized control.

Current Limitations: Existing methods for controlling specific body parts typically rely on high-dimensional constraints (e.g., joint trajectories, keyframes) or text-based editing.
- Trajectory-based methods are cumbersome, requiring dense, per-frame specifications that are difficult for users to edit intuitively (e.g., "make the arm swing twice as large").
- Text-based editing often requires large paired datasets and suffers from the inherent ambiguity of natural language, making precise, quantitative adjustments difficult.
Goal: The authors aim to create a plug-and-play framework that allows users to intuitively and predictably edit specific body parts (magnitude, speed, timing) while maintaining global motion coherence and without altering the underlying generative backbone.

2. Methodology: Modular Body-Part Phase Control

The proposed framework introduces a structured, scalar-based phase interface to decouple localized control from the generative model. It consists of three core components:

A. Body-Part Phase Module (Extraction & Representation)

Semantic Partitioning: The human skeleton is divided into five high-level parts: left upper, right upper, left lower, right lower, and trunk.
Phase Parameterization: Instead of modeling raw joint trajectories, the method models the kinematic signal of each body part as a sum of sinusoidal basis functions. For a given body part $b$ $b$ , the motion $v_b(\tau)$ $v_{b} (τ)$ is approximated by:
$v_b(\tau) \approx \sum_{k=1}^{K} A_{b,k}\cos(2\pi(F_{b,k}\tau + S_{b,k})) + B_{b,k}$
Where the parameters are:
- Amplitude ( $A$ ): Controls motion magnitude/intensity.
- Frequency ( $F$ ): Controls execution pace or repetition rate.
- Phase Shift ( $S$ ): Controls temporal alignment (timing).
- Offset ( $B$ ): Static offset.
Extraction: A pre-trained "Body-Part Phase Extractor" (based on a periodic autoencoder) estimates these scalar parameters from reference motions.

B. Phase Encoder

The extracted scalar parameters are mapped into a continuous, time-dependent phase manifold via sinusoidal embeddings.
A lightweight Phase Encoder (a stack of 1D temporal convolutions) projects this structured phase sequence into a compact, latent-aligned control embedding ( $g$ ).

C. Phase ControlNet (Injection)

Architecture: A Phase ControlNet is introduced as an auxiliary branch that mirrors the architecture of the backbone generator (Diffusion or Flow-based).
Residual Injection: Instead of concatenating control signals directly to the input, the ControlNet processes the augmented latent and generates block-wise residuals ( $r^{(\ell)}$ ).
Zero-Initialization: The projection layers in the ControlNet are initialized to zero. This ensures that at the start of training, the ControlNet has no influence, preserving the pre-trained generator's quality.
Modulation: During inference, these residuals are additively injected into the backbone's intermediate feature maps:
$h^{(\ell)} \leftarrow h^{(\ell)} + r^{(\ell)}$
This allows the control signal to modulate specific body parts without disrupting the global motion prior.

D. Training & Inference Pipeline

Two-Stage Training:
1. Phase Pretraining: Train the phase extractor to reconstruct motion from phase parameters (frozen thereafter).
2. ControlNet Training: Jointly optimize the Phase Encoder and ControlNet using a standard generative loss (e.g., noise prediction) plus a Phase Consistency Loss. The consistency loss ensures the generated motion's extracted phase matches the user-edited reference phase for the target body parts.
Interactive Workflow: Users can extract phase parameters from a generated motion, apply simple scalar edits (e.g., $A \times 1.5$ , $S + 0.1$ ), and re-generate the motion with the updated phase manifold.

3. Key Contributions

Structured Phase Representation: First application of a structured, low-dimensional phase representation (Amplitude, Frequency, Phase Shift) specifically for localized body-part editing in generative motion models.
Modular Plug-and-Play Framework: A design that decouples control from the backbone via residual modulation, enabling integration with both Diffusion and Flow-based models without modifying their original architectures.
Predictable & Interpretable Control: The scalar interface allows for mathematically predictable edits (e.g., linear scaling of amplitude directly results in proportional motion magnitude), overcoming the ambiguity of text prompts and the complexity of trajectory constraints.

4. Experimental Results

The method was evaluated on the HumanML3D dataset using both Diffusion (MotionLCM) and Flow-based backbones.

Quantitative Performance:
- Quality: The method achieves competitive or superior results in R-Precision (text-motion alignment), FID (motion quality), and Diversity compared to state-of-the-art baselines (e.g., MLD, MotionLCM, MDM).
- Efficiency: Integrating the ControlNet adds negligible latency (only ~0.004s increase in Average Inference Time per Sentence).
- Control Accuracy: Experiments show a highly proportional linear correlation between user-specified scalar edits (0.5x to 1.5x scale) and the resulting motion changes in amplitude and frequency.
Qualitative Results:
- Timing (Phase Shift): Successfully advanced or delayed specific gestures (e.g., scratching head) without affecting the rest of the body.
- Magnitude (Amplitude): Controlled the size of arm swings (subtle vs. exaggerated) while keeping the lower body coherent.
- Speed (Frequency): Adjusted walking pace (slow-motion vs. brisk) by modifying leg frequency, preserving the semantic "straight direction."
Ablation Studies:
- ControlNet vs. Concatenation: The residual injection strategy (ControlNet) significantly outperforms naive concatenation of phase features, proving that dedicated modulation pathways are necessary to prevent the generator from ignoring control signals.
- Part-wise vs. Whole-body: Decomposing motion into specific body parts yields better control fidelity than using a global whole-body phase representation.

5. Significance

This work represents a paradigm shift in controllable T2M generation by moving away from high-dimensional, hard-to-edit constraints toward a compact, interpretable, and mathematically grounded control interface.

Practicality: It enables interactive, multi-round refinement workflows where users can intuitively "tune" animations (e.g., "make the wave faster" or "delay the step") without retraining the entire model or specifying complex trajectories.
Generalizability: The modular design makes it applicable across different generative paradigms (Diffusion and Flow), offering a robust solution for character animation and virtual avatar manipulation.
Limitations: The method relies on pre-trained phase extractors for specific skeleton topologies and is optimized for periodic/quasi-periodic motions, potentially struggling with highly aperiodic or contact-rich long-horizon behaviors.