Preference-Conditioned Multi-Objective RL for Integrated Command Tracking and Force Compliance in Humanoid Locomotion

Imagine a humanoid robot as a very strong, very fast runner who has been trained to ignore everything around them. If you try to push them, they stiffen up like a brick wall to stay on their path. While this makes them great at running in a straight line, it makes them terrible at interacting with humans. If a person tries to gently guide them, the robot fights back, potentially knocking the person over or acting dangerously rigid.

This paper introduces a new way to train these robots so they can be both a determined runner and a gentle dance partner, depending on what you need at that moment.

Here is the breakdown of their solution, using simple analogies:

1. The Problem: The "Stubborn vs. Wobbly" Dilemma

Traditionally, robot trainers use a method where they constantly push the robot during training to make it "tough."

The Result: The robot becomes a "stubborn mule." It follows its instructions perfectly and resists any push.
The Flaw: If a human tries to guide the robot (like holding its hand to walk around a corner), the robot fights back. It's too stiff to be safe or natural in a human environment.

2. The Solution: The "Volume Knob" for Behavior

The authors created a system where you don't need to train two different robots. Instead, you train one robot that has a "Volume Knob" (called a Preference Input).

Turn the knob to "Command": The robot acts like a race car. It ignores your gentle pushes and focuses entirely on following its GPS (velocity commands).
Turn the knob to "Compliance": The robot acts like a soft, yielding dance partner. If you push it, it goes with the flow. It lets you guide it easily.
Turn the knob to "Middle": The robot finds a balance. It tries to follow its path but will gently yield if you push hard enough.

You can slide this knob anywhere in between, and the robot instantly changes its personality without needing to be retrained.

3. How They Taught It: The "Translator" Trick

Here is the tricky part: Robots usually can't "feel" a human pushing them unless they have expensive, delicate sensors on their skin. But the robot needs to know it's being pushed to comply.

The researchers used a clever Teacher-Student trick:

The Teacher (in Simulation): The robot is trained in a video game world where the computer knows exactly how hard the human is pushing. The teacher learns the secret connection between "being pushed" and "how the robot's body moves."
The Student (on the Real Robot): The real robot only has cameras and internal sensors (like knowing its own joint angles). It doesn't know the push force directly.
The Magic: The system forces the "Student" to guess what the "Teacher" knows. By looking at how its body is moving, the robot learns to infer (guess) that it is being pushed, even without a force sensor. It's like learning to feel a breeze just by watching the leaves move, rather than feeling the wind on your face.

4. The "Resistance" Analogy

To make the math work, the researchers treated "pushing force" and "walking speed" as if they were the same thing.

Imagine walking through water. If you push against the water, you slow down.
They taught the robot: "If someone pushes you, it's like you are suddenly walking through thick mud. You should slow down or move in the direction of the push, just like you would if you were trying to walk through water."
This simple rule allowed the robot to understand that being pushed and being told to stop are mathematically similar problems.

5. The Results: From Lab to Real Life

They tested this on a real, adult-sized robot named Booster T1.

The Test: Humans pulled the robot by the hand, shoulder, and neck.
The Old Robot: Fought back, required huge strength to move, and was jerky.
The New Robot: Moved smoothly with very little effort (about 10 Newtons of force, roughly the weight of a liter of water). It could walk across rough grass, soccer fields, and uneven ground while being gently guided by a human.
The Safety Check: They even hit the robot with a heavy ball (up to 5kg). The robot didn't fall; it just took a step back and absorbed the blow, showing it was still tough enough to handle accidents, even while being "soft."

The Bottom Line

This paper solves the "Brick Wall vs. Jellyfish" problem. It gives us a single robot brain that can be tough as a rock when it needs to navigate a crowd, but soft as jelly when a human needs to guide it. It's a major step toward robots that can safely work and play alongside us in our homes and cities.

1. Problem Statement

Humanoid robots operating in human-centered environments require two often conflicting capabilities:

Accurate Command Tracking: The ability to follow velocity commands for navigation.
Force Compliance: The ability to yield to external forces (e.g., human guidance) safely and naturally.

The Core Conflict: Existing Reinforcement Learning (RL) approaches for humanoid locomotion typically train policies to be robust against external perturbations (pushes). While this ensures stability, it inadvertently biases the policy toward resisting external forces. This results in rigid, potentially unsafe behaviors where the robot fights human guidance rather than complying with it. Current methods struggle to balance these opposing objectives within a single policy without complex hierarchical architectures or staged training.

2. Methodology

The authors propose a Preference-Conditioned Multi-Objective Reinforcement Learning (MORL) framework that unifies command tracking and force compliance into a single, deployable policy.

A. Velocity-Resistance Modeling

To make command tracking and force compliance comparable within a single reward function, the authors introduce a velocity-resistance model:

Concept: External forces are modeled as equivalent velocities using a simplified physical relationship: $v_{ext} = k \cdot F_{ext}$ .
Mechanism: This maps sustained external forces into an "equivalent velocity" that competes directly with the commanded velocity in the reward function.
Benefit: This allows the policy to treat a human pushing the robot as a "velocity command" to move in the direction of the push, enabling a unified reward structure.

B. Multi-Objective Formulation

The problem is formulated as a Multi-Objective RL task with a preference vector $w = [w_c, w_f, w_r]$ :

Objectives:
1. Command Tracking ( $r_c$ ): Minimizing error between actual velocity and commanded velocity.
2. Force Compliance ( $r_f$ ): Minimizing error between actual velocity and the equivalent velocity derived from external forces.
3. Regularization ( $r_r$ ): Standard penalties for stability, energy, and height.
Preference Conditioning: The policy $\pi(a|o, w)$ is conditioned on a user-specified preference vector. By adjusting the weights $w_c$ and $w_f$ (where $w_c + w_f = 2$ ), the policy can smoothly interpolate between rigid tracking ( $w_c=2$ ) and high compliance ( $w_f=2$ ).

C. Asymmetric Actor-Critic with Privileged Reconstruction

To handle the fact that external forces are not directly measurable by onboard sensors during deployment:

Architecture: An asymmetric actor-critic structure is used.
- Critic (Training): Has access to privileged information (actual external forces, full state) available only in simulation.
- Actor (Deployment): Receives only deployable observations (proprioception, commands, IMU).
Encoder-Decoder: An encoder processes historical observations to infer a latent embedding $z_t$ . A decoder attempts to reconstruct the privileged observations (specifically external forces) from $z_t$ .
Training: The encoder is trained to extract force-aware features via a reconstruction loss ( $L_{rec}$ ), enabling the actor to "sense" forces implicitly during deployment without tactile sensors.

3. Key Contributions

Unified Velocity-Resistance Model: A novel physical abstraction that maps external forces to equivalent velocities, allowing for consistent reward design between tracking and compliance objectives.
Preference-Conditioned MORL Framework: A single policy capable of covering a continuous spectrum of trade-offs between command following and force compliance via a simple preference input, eliminating the need for hierarchical controllers or multiple training stages.
Deployable Force-Inference: A one-stage training pipeline using an encoder-decoder to infer force-related features from proprioception, enabling robust force compliance on real hardware without direct force sensors.

4. Experimental Results

Simulation Experiments (Booster T1 Humanoid)

Trade-off Performance: The policy successfully demonstrated a smooth, monotonic trade-off curve. As the command weight ( $w_c$ ) decreased and compliance weight ( $w_f$ ) increased, the robot shifted from rigid tracking to yielding to external forces.
Online Adaptation: The policy could switch behaviors online (e.g., from tracking to complying) within seconds without instability, a capability absent in single-objective baselines.
Robustness: Under high-magnitude perturbations (up to 50N), the MORL policies (specifically those tuned for compliance) maintained higher success rates and lower peak joint torques compared to a standard robust baseline.
Ablation: The MORL approach showed more stable training convergence compared to a Single-Objective RL (SORL) approach attempting to balance the same conflicting objectives with fixed weights.

Real-World Experiments (Booster T1 Hardware)

Adaptability: The robot successfully adapted to different preferences in real-time, yielding to human guidance when $w_f$ was high and resisting it when $w_c$ was high.
Force Compliance Quantification: Using a dynamometer, the authors showed that the MORL policy required only ~10 N of force to move the robot smoothly, whereas the baseline required >25 N and often exceeded measurement limits due to rigid resistance.
Cross-Directional Walking: The robot successfully synthesized diagonal walking by combining forward velocity commands with lateral external forces.
Robustness: The robot withstood impacts from suspended balls weighing up to 5 kg without falling, demonstrating that compliance does not sacrifice stability.

5. Significance

This work bridges the gap between robustness (resisting disturbances) and interactivity (complying with humans) in humanoid locomotion.

Practical Deployment: It enables humanoids to operate safely in close proximity to humans, allowing for physical guidance without the risk of the robot fighting back or falling.
Simplicity: The approach avoids complex hierarchical control systems, relying instead on a single, preference-conditioned policy trained via standard RL pipelines.
Generalizability: The framework is applicable to various interaction scenarios, from collaborative manufacturing to assistive robotics, where the balance between autonomy and human guidance is dynamic.

The paper demonstrates that by explicitly modeling the trade-off as a multi-objective problem and using a velocity-resistance abstraction, it is possible to create humanoids that are both stable and naturally compliant.