APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots

Imagine a humanoid robot as a clumsy toddler learning to walk. For a long time, these robots were great at walking on flat ground or stepping over small puddles. But if you put a high table in front of them (one taller than their legs), they would usually try to jump onto it.

The problem with jumping is that it's like a toddler trying to hop onto a kitchen counter: it requires a huge burst of energy, often results in a hard crash, and if the robot misses, it could break its joints or fall over. It's dangerous and inefficient.

APEX is a new system that teaches the robot to stop jumping and start climbing, just like a human would. Here is how it works, broken down into simple concepts:

1. The "Climbing" Mindset

Instead of treating the robot like a machine that only uses its feet, APEX teaches it to use its whole body.

The Analogy: Think of a rock climber scaling a wall. They don't just jump; they use their hands, knees, and torso to find holds, shift their weight, and pull themselves up.
The Robot's Skills: The robot learns six specific "moves":
- Climb-up: Using hands and feet to pull itself onto a high platform.
- Climb-down: Carefully lowering itself back down.
- Stand-up & Lie-down: Changing its posture (from standing to lying on its stomach) to fit through tight spaces or reposition itself.
- Walk & Crawl: Moving around once it's on the platform.

2. The "Ratchet" Reward (The Secret Sauce)

This is the most clever part of the paper. Usually, when you teach a robot, you give it a reward when it finishes a task. But for climbing, waiting until the very end to say "Good job!" is too slow. The robot might get stuck halfway and never know what to do next.

The authors invented a "Ratchet Progress Reward."

The Analogy: Imagine a ratchet wrench (the tool mechanics use). It only turns forward; it can't slip backward.
How it works: The robot keeps a mental note of its "best progress so far."
- If the robot moves forward (even a tiny bit) or gets its hand closer to the edge, it gets a tiny reward.
- If it moves backward or stays in the same spot, it gets a penalty.
- Crucially: It doesn't care how fast the robot moves. It only cares that it is making genuine progress.
Why this matters: This stops the robot from "cheating" by shaking back and forth or rushing into a dangerous jump. It forces the robot to be patient, find a stable handhold, and slowly pull itself up, just like a careful climber.

3. The "Teacher and Student" System

Learning all these complex moves at once is too hard for one brain. So, the researchers used a two-step process:

Step 1: The Teachers. They trained six separate "expert" robots (Teachers). One was an expert at climbing up, another at climbing down, another at standing up, etc. They learned these skills in a virtual world (simulation) where they could fail thousands of times without breaking anything.
Step 2: The Student. They created one "Student" robot and taught it to copy all six teachers.
- The Analogy: Imagine a student taking notes from six different professors (one for math, one for history, one for art). The student learns to look at the situation (the terrain) and decide: "Oh, I'm at a high ledge? I'll use the Climbing Professor's notes. I'm on the ground? I'll use the Walking Professor's notes."
- The Student learns to switch between these skills smoothly without falling over.

4. Seeing the World (The Eyes)

Robots often struggle to see the real world because their cameras get confused by shadows, dust, or their own limbs blocking the view.

The Analogy: Imagine trying to walk through a foggy room while wearing glasses that have smudges on them.
The Fix: The researchers taught the robot to expect "smudges" (noise and errors) while it was training. They also added a "clean-up" filter for the real world. This means when the robot sees a weird blob of data that looks like a wall but isn't, it knows to ignore it.

The Result: A Real-World Breakthrough

The team tested this on a Unitree G1, a real humanoid robot with 29 moving joints.

The Challenge: They placed a platform 0.8 meters high (about 31 inches). For this robot, that is 114% of its leg length. It's like a human trying to climb onto a table that is taller than their own legs.
The Outcome: The robot didn't jump. It didn't fall. It walked up to the edge, used its hands to pull itself up, stood up on the platform, walked across, lay down, stood up again, and climbed back down.
Zero-Shot Transfer: The most impressive part? They trained the robot in a computer simulation, and when they put it in the real world, it worked immediately without any extra tuning. It was like the robot woke up in the real world and just knew how to do it.

Summary

APEX is like teaching a robot to be a careful, patient rock climber instead of a reckless jumper. By using a special "progress tracker" (the ratchet) and a "teacher-student" learning method, they created a robot that can safely navigate high, difficult terrain that was previously impossible for machines to handle. It's a giant leap (pun intended) toward robots that can actually help us in messy, real-world environments.

Here is a detailed technical summary of the paper "APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots."

1. Problem Statement

Humanoid robots have made significant strides in locomotion over uneven terrain using Deep Reinforcement Learning (DRL). However, traversing platforms that exceed the robot's leg length (specifically >100% of leg length) remains a critical challenge.

Limitations of Current Approaches: Existing methods often rely on jumping or "parkour" style maneuvers. These solutions require large impulsive torques, result in high-impact dynamics, and are often unsafe for real-world deployment due to actuator limits and the risk of falling.
The Gap: There is a lack of robust, low-impact strategies for humans-sized robots to climb, traverse, and descend extremely high platforms (e.g., tables, ledges) using full-body coordination (arms, torso, legs) rather than just feet.
Key Challenges:
1. Learning Contact-Rich Maneuvers: Tasks like climbing up/down, standing up, and lying down are goal-reaching (not cyclic) and involve complex contact transitions, making standard velocity-tracking rewards ineffective.
2. Long-Horizon Sequential Decision Making: Complete traversal requires autonomous selection and seamless switching between six distinct skills (climb-up, climb-down, stand-up, lie-down, walk, crawl) based on local geometry and commands.

2. Methodology: The APEX Framework

The authors propose APEX, a two-stage learning framework designed to train a perceptive, context-aware policy for high-platform traversal on a 29-DoF Unitree G1 humanoid.

A. Generalized Ratchet Progress Reward

To solve the difficulty of learning goal-reaching, contact-rich maneuvers, the authors introduce a novel reward mechanism:

Concept: Instead of tracking a reference trajectory or velocity, the reward tracks the "best-so-far" task state ( $x^*_t$ ).
Mechanism: The agent receives a reward only if its current task state ( $x_t$ ) strictly surpasses its historical best ( $x^*_t$ ). If it fails to improve or regresses, it is penalized.
Benefits:
- Dense Supervision: Provides a signal at every timestep without requiring a predefined motion template.
- Velocity-Free: Does not encourage rushing, allowing the robot to pause and stabilize contacts (e.g., holding a leg in place until stable) rather than generating high-impact jumps.
- Prevents Exploitation: Prevents the agent from "oscillating" back and forth to game the reward, as it must make genuine progress toward the goal.

B. Two-Stage Learning Pipeline

Teacher Training (Skill Acquisition):
- Six specialized policies are trained via DRL: four goal-reaching maneuvers (climb-up, climb-down, stand-up, lie-down) and two cyclic locomotion skills (walk, crawl).
- Perception: Uses LiDAR-based elevation mapping. To bridge the sim-to-real gap, the system employs a dual strategy:
  - Training: Explicitly models mapping artifacts (noise, drift, outliers).
  - Deployment: Applies real-time filtering and inpainting to raw elevation maps.
- Terminal State Shaping: Rewards are designed to ensure the terminal state of one skill (e.g., lying down) matches the initial state distribution of the next (e.g., crawling), facilitating smooth transitions.
Policy Distillation (Skill Integration):
- A unified "student" policy is distilled from the six "teacher" policies.
- Data Strategy: Uses a "divide-and-conquer" data sampling rule. Instead of sequential rollouts, training environments are assigned to specific skills or pairs of consecutive skills to ensure balanced coverage of the entire maneuver distribution.
- Algorithm: Combines Behavior Cloning (BC) for pretraining and DAgger for iterative refinement, enabling the student to autonomously select the correct skill based on LiDAR observations and user commands.

3. Key Contributions

Unified High-Platform Traversal: The first system to achieve zero-shot sim-to-real traversal of platforms exceeding 114% of the robot's leg length (0.8m on a Unitree G1) using full-body climbing rather than jumping.
Generalized Ratchet Progress Reward: A novel reward formulation that enables the learning of safe, contact-rich, goal-reaching behaviors by penalizing non-improving steps while avoiding velocity-driven unsafe dynamics.
Robust Sim-to-Real Transfer: A dual-strategy approach (artifact modeling in sim + post-processing in real) that allows the robot to handle LiDAR noise, drift, and outliers during dynamic climbing.
Context-Aware Skill Switching: A distilled policy that autonomously transitions between six distinct behaviors (climbing, crawling, standing, lying, walking) based on environmental context without external intervention.

4. Experimental Results

Experiments were conducted on a Unitree G1 (29-DoF) humanoid in both simulation and the real world.

Zero-Shot Sim-to-Real: The robot successfully traversed 0.8m platforms (approx. 114% leg length) in a continuous loop (climb-up $\to$ walk/crawl $\to$ stand/lie $\to$ climb-down) without any real-world fine-tuning.
Success Rates:
- Simulation: Near-perfect success rates (>98%) for all individual skills.
- Real World: Achieved 95.4% success rate in continuous traversal trials. Individual skills (climb-up, climb-down, stand-up, lie-down) achieved 100% success in real-world trials (5/5) across various approach angles and heights.
Robustness:
- Perturbations: The robot recovered from severe external kicks and stumbled states, adapting its gait and contact points to regain balance.
- Out-of-Distribution: Successfully climbed platforms with soft, compliant surfaces (vinyl-foam mats) and handled extreme approach angles ( $\pm 65^\circ$ ) not seen during training.
- Perception: Maintained performance despite LiDAR "ghost points" and mapping artifacts, validating the robustness of the perception pipeline.
Ablation Studies: Comparisons with baseline reward functions (velocity tracking, distance minimization, RND) showed that only the Ratchet Progress Reward successfully learned stable, low-impact climbing behaviors. Baselines either failed to climb, resulted in unsafe high-impact jumps, or got stuck in local optima.

5. Significance

This work represents a major leap forward in humanoid robotics by moving beyond simple bipedal walking to adaptive, full-body manipulation of the environment.

Safety: By replacing high-impact jumping with controlled, contact-rich climbing, the system operates within safe torque and force limits, making it viable for real-world deployment.
Versatility: The ability to autonomously switch between standing, crawling, climbing, and lying down allows humanoids to navigate complex, multi-level environments (e.g., disaster zones, construction sites, homes with stairs/furniture) that were previously inaccessible.
Scalability: The proposed reward framework and distillation pipeline offer a generalizable approach for learning complex, non-cyclic, contact-rich behaviors in other robotic domains.