HybridMimic: Hybrid RL-Centroidal Control for Humanoid Motion Mimicking

Imagine you are teaching a robot to dance. You have two main ways to do this:

The "Gymnast" Approach (Standard RL): You throw the robot into a virtual gym and tell it, "Just figure out how to move your legs to match this video of a human dancer." The robot tries millions of times, learning by trial and error. It gets really good at the specific dance moves in the simulation. But, when you put it on the real floor, if the lighting is slightly different or the floor is a bit slippery, the robot might stumble. Why? Because it learned the moves but didn't really understand the physics of why those moves work. It's like a gymnast who memorized a routine but doesn't understand balance; if the beam wobbles, they fall.
The "Engineer" Approach (Model-Based Control): You give the robot a strict set of math rules: "When your left foot touches the ground, apply exactly this much force at this exact second." This is very stable and physics-perfect. But it's rigid. If the robot needs to kick a ball or trip over a rock, the pre-written math rules break because the robot didn't plan for that specific moment. It's like a dancer who can only perform if the music never changes tempo and the floor is never uneven.

Enter HybridMimic: The "Smart Dancer"

This paper introduces HybridMimic, a new way to teach robots that combines the best of both worlds. Think of it as a robot that has a Gymnast's intuition but is guided by an Engineer's brain.

Here is how it works, using simple analogies:

1. The Two-Part Brain

Instead of just one brain trying to do everything, HybridMimic splits the job:

The "Gymnast" (The AI Policy): This is the part that learns from watching humans. It looks at the dance video and says, "Okay, I need to lift my leg high and lean forward." It decides the goal.
The "Engineer" (The Centroidal Controller): This is the part that understands physics. It takes the Gymnast's goal and asks, "To lift that leg without falling over, exactly how much force do I need to push against the ground? And when exactly does my foot touch the floor?"

2. The Magic of "Guessing the Touch"

The biggest problem with the "Engineer" approach is that it usually needs a pre-written schedule: "Touch left foot at 1.0 seconds, right foot at 1.5 seconds." But in real life, you don't know exactly when you'll step on a rock or slip.

HybridMimic is special because the Gymnast learns to guess when the feet will touch the ground. It predicts, "I think my foot will hit the floor now," and tells the Engineer. The Engineer then instantly calculates the perfect physics-based force for that exact moment.

Analogy: Imagine a dancer who can feel the floor before they even step on it. They don't need a script telling them when to step; they just know and adjust their balance instantly.

3. The "Physics Check" (Rewards)

How do we teach the Gymnast to be a good partner to the Engineer? The paper uses special "rewards" (like points in a video game).

If the Gymnast tells the Engineer to push with a force that would break the robot's motors, the robot gets a "bad score."
If the Gymnast predicts the foot touch correctly, and the Engineer calculates a smooth force, they get a "good score."
This teaches the AI to stop guessing wildly and start making physically realistic guesses.

Why Does This Matter? (The Results)

The researchers tested this on a real robot named Booster T1. They asked it to walk, kick a ball, and step backward.

The Old Way (Just the Gymnast): The robot could do the moves in the computer simulation, but when they tried it on the real robot, it was a bit shaky and missed its target position by a noticeable amount.
The Hybrid Way: The robot was much steadier. It tracked the path 13% better than the old method.

The Big Takeaway:
HybridMimic is like giving a robot a "gut feeling" for movement (learned from humans) but backing it up with a "safety net" of physics math. This means the robot can learn complex, dynamic moves like kicking or dancing, but it won't fall over when the real world gets messy. It makes robots safer, more accurate, and ready for the real world without needing a human to write a script for every single step.

1. Problem Statement

The paper addresses the challenge of motion mimicking for humanoid robots using Reinforcement Learning (RL). While standard RL frameworks (often using Proportional-Derivative or PD controllers) demonstrate impressive agility in simulation, they face two critical issues during real-world deployment:

Lack of Physical Reasoning: Standard RL policies often bypass explicit reasoning about robot dynamics. When the robot encounters out-of-distribution environments or disturbances, the policy may generate physically infeasible commands, leading to a degraded "sim-to-real" performance gap.
Limitations of Model-Based Control: Traditional model-based controllers (e.g., centroidal dynamics) offer physical grounding but typically rely on predefined, hand-crafted contact schedules (knowing exactly when feet touch the ground). This rigidity limits their ability to handle complex, non-periodic, or "in-the-wild" motions where contact timing is difficult to specify a priori.

The goal is to create a control architecture that combines the adaptability of RL with the physical feasibility of model-based control, without relying on predefined contact schedules.

2. Methodology: HybridMimic

The authors propose HybridMimic, a hybrid control framework that integrates an RL policy with a Centroidal-Model-Based Controller. The system operates at two levels:

A. Control Architecture

The total motor torque $u$ is the sum of a feedforward torque ( $u_{FF}$ ) and a PD torque ( $u_{PD}$ ):
$u = u_{FF} + u_{PD}$

PD Term ( $u_{PD}$ ): Generated by the policy network to track joint positions ( $q_{cmd}$ ). This handles feedback and local stabilization.
Feedforward Term ( $u_{FF}$ ): Generated by a Centroidal Controller based on a Single Rigid Body (SRB) model. This term explicitly accounts for ground reaction forces (GRF) and centroidal dynamics.
- The policy network outputs continuous contact states ( $w_i$ ) and commanded centroidal velocities ( $\dot{x}_{cmd}$ ).
- A Quadratic Programming (QP) solver computes the optimal ground reaction wrench ( $F^*$ ) to satisfy the centroidal dynamics equation ( $\ddot{x} = \hat{g} + AF$ ) while minimizing the difference between the desired torque and the reference torque.
- The resulting $F^*$ is converted into the feedforward torque $u_{FF}$ using the robot's full-order dynamics.

B. Key Innovations in Policy Design

Contact-Schedule-Free Formulation: Unlike previous hybrid methods, the policy predicts continuous contact states ( $w_i$ ) based on observations. This allows the controller to handle smooth, dynamic contact transitions (e.g., kicking, stumbling) without predefined timing.
Physics-Informed Rewards: To ensure the policy utilizes the centroidal controller correctly, the authors introduce specific reward terms:
- GRF Reward: Minimizes the error between the predicted ground reaction wrench and the simulated ground truth.
- Contact State Reward: Penalizes deviations between the predicted contact state and the simulator's true contact state.
- Torque Limit Reward: Encourages the policy to implicitly respect motor torque limits via the reference torque output.
- Centroidal Acceleration Reward: Ensures the simulated acceleration matches the commanded acceleration.

C. Training Setup

Algorithm: Proximal Policy Optimization (PPO) with an asymmetric actor-critic.
Observations: The actor uses baseline observations (joint states, velocities, reference motion), while the critic has access to privileged information (true base position, orientation).
Domain Randomization: Applied to mass, inertia, Jacobians, and sensor noise to improve robustness.

3. Key Contributions

Contact-Schedule-Free Hybrid Control: The framework eliminates the need for hand-crafted contact schedules by learning continuous contact states directly from the policy, enabling robust mimicking of complex, non-periodic motions.
Physics-Informed Reward Design: Novel reward terms guide the policy to generate physically consistent ground reaction forces and contact states, ensuring the centroidal controller operates within feasible physical constraints.
Real-World Validation: The method was successfully deployed on the Booster T1 humanoid robot, demonstrating significant improvements in sim-to-real transfer compared to state-of-the-art baselines.

4. Results

The authors evaluated HybridMimic against a standard RL baseline (BeyondMimic) and ablated variants (fixed contact schedules, no reference torque cost) on the Booster T1 robot.

Sim-to-Real Performance:
- HybridMimic achieved a 13% reduction in average base position tracking error compared to the BeyondMimic baseline across various tasks (walking, side-stepping, backwards stepping, and kicking).
- The robot exhibited smoother motion trajectories and reduced jitter compared to the baseline.
Sim-to-Sim Analysis:
- HybridMimic outperformed variants with fixed contact schedules (HybridMimic+FCS) in complex tasks like running in circles, where contact timing is irregular.
- The ablation study confirmed that both the learned contact states and the reference torque cost are essential for robust performance.
Interpretability: The system allows for transparent tuning of physical parameters (e.g., velocity tracking gains $K_{vel}$ ), making the deployment process more deterministic and explainable than pure black-box RL.

5. Significance

HybridMimic represents a significant step forward in embodied intelligence for humanoid robots. By bridging the gap between data-driven RL and physics-based model control, it offers a solution that is:

Physically Feasible: Generates commands that respect dynamics and torque limits, reducing the risk of hardware damage.
Adaptable: Capable of handling diverse, dynamic motions without manual retuning of contact schedules.
Robust: Demonstrates superior sim-to-real transfer, a critical bottleneck in deploying RL for real-world robotics.

The work suggests that integrating model-based principles (specifically centroidal dynamics) into RL policies via learned modulation is a viable path toward achieving agile, human-like locomotion in unstructured environments. Future work aims to integrate model-based swing-leg control to further enhance foot placement accuracy.