SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning

Imagine you are walking through a crowded, bumpy street while carrying a tray with a glass of wine and a plate of spaghetti. Your goal is to get to the other side without spilling a drop or dropping the plate.

Now, imagine that you are a robot. Not just any robot, but a two-legged humanoid robot that has to walk, turn, and balance all at once. Every time it takes a step, its body naturally jiggles and sways. If it tries to hold the tray perfectly still while walking, it might trip. If it focuses only on walking, the wine spills.

This is the problem the paper "SteadyTray" solves. Here is how they did it, explained simply:

1. The Problem: The "Jiggly" Walk

Humanoid robots are great at walking, but walking creates vibrations. Think of it like a car driving over a gravel road; the whole car shakes. If you put a cup of coffee on the dashboard of that car, it will spill.

Previous robots tried to solve this by having one giant "brain" try to figure out how to walk and keep the tray steady at the same time. It was like asking a student to do advanced calculus while simultaneously juggling three balls. They often failed, especially when someone bumped into the robot or the robot had to turn quickly.

2. The Solution: The "Coach and the Player" (ReST-RL)

The researchers came up with a clever two-step system called ReST-RL. Instead of one giant brain, they used a "Teacher" and a "Student" approach with a special twist.

The Base Policy (The Experienced Walker): First, they trained a robot to be a great walker. This robot knows how to walk, turn, and stay upright on two legs. It's like a professional dancer who knows the steps perfectly. This part is "frozen," meaning we don't change how it walks.
The Residual Module (The Stabilizing Coach): Then, they added a second, smaller "brain" on top. This brain doesn't tell the robot how to walk. Instead, it acts like a coach watching the dancer.
- The coach sees the wine glass wobbling.
- The coach whispers tiny, quick corrections to the dancer's arms: "Tilt left a little," "Move your hand forward," "Calm down."
- The dancer (the base walker) keeps doing its dance steps, but the coach's tiny nudges cancel out the wobble.

This separation is key. The robot doesn't have to relearn how to walk; it just learns how to adjust its walk to keep the tray steady.

3. The Secret Sauce: "Training with a Delay"

One of the smartest tricks in the paper is how they trained the robot. In the real world, cameras and sensors are slow. By the time the robot "sees" the wine glass tipping, a fraction of a second has already passed.

To prepare for this, the researchers intentionally slowed down the robot's vision during training. They made the robot practice while looking at "old" data.

The Analogy: Imagine learning to ride a bike while wearing glasses that show you where you were 0.5 seconds ago. It's hard at first, but once you get used to it, you become incredibly good at predicting where you will be.
The Result: When they took the glasses off (deployed the robot in the real world), the robot was so good at predicting the future that it could stabilize the tray even when the sensors were slow or when someone pushed it.

4. The Results: The "Unitree G1" Test

They tested this on a real robot called the Unitree G1.

The Test: They made the robot walk while carrying a tray with a wine glass full of liquid, a coffee cup, and even medical tools.
The Chaos: They kicked the robot, pushed the tray, and made it walk fast and slow.
The Outcome: The robot kept the tray level. The wine didn't spill. The tools didn't fall. It worked so well that it could handle these tasks without needing to be retrained for every new object.

Why This Matters

This isn't just about robots carrying drinks. It's about making robots useful in our messy, human world.

Future Jobs: Imagine a robot waiter in a busy restaurant that never spills a drink, even if a customer bumps into it.
Hospitals: Imagine a robot carrying sterile instruments through a crowded hallway without shaking them.
Elder Care: Imagine a robot bringing a tray of medicine to an elderly person, navigating around furniture and people without dropping anything.

In short: The paper teaches a robot to be a "dancer" that knows how to walk, and a "magician" that knows how to keep a tray perfectly still, all by using a smart, layered learning system that prepares for the real world's delays and bumps.

Here is a detailed technical summary of the paper "SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning."

1. Problem Statement

The paper addresses the critical engineering challenge of transporting unsecured, unstable payloads (e.g., liquid-filled glasses, fragile instruments) on a tray while a bipedal humanoid robot walks.

Core Difficulty: Bipedal locomotion inherently generates torso oscillations due to foot impacts. These disturbances propagate through the kinematic chain, causing the tray to tilt.
The Conflict: The robot must simultaneously execute agile whole-body gaits for locomotion and maintain a near-level end-effector orientation to prevent the payload from slipping, tipping, or spilling.
Limitations of Existing Methods:
- Monolithic End-to-End RL: Struggles to balance the competing objectives of locomotion and stabilization, often degrading gait stability or failing to stabilize the payload under disturbances.
- Existing Stabilization (e.g., SoFTA): Focuses on hanging or attached objects, not unsecured objects on a tray subject to dynamic maneuvers (turning, accelerating, external pushes).

2. Methodology: ReST-RL Framework

The authors propose ReST-RL (Residual Student-Teacher Reinforcement Learning), a hierarchical architecture that explicitly decouples locomotion from payload stabilization.

A. Hierarchical Architecture

The framework consists of two main components trained in stages:

Base Policy ( $\pi_{base}$ ): A pre-trained locomotion policy optimized to walk while holding a tray level. It takes proprioceptive data (joint positions, velocities, gravity) and velocity commands as input. This policy is frozen during the subsequent training phase to preserve robust gait skills.
Residual Module: A lightweight module trained on top of the frozen base policy to generate corrective actions specifically for payload stabilization. It learns to cancel gait-induced perturbations.

B. Training Strategy

The training follows a Teacher-Student paradigm with Privileged Information:

Teacher Training (Privileged): The residual module (Encoder + Adapter) is trained using privileged observations ( $s_{priv}$ $s_{p r i v}$ ) that are available in simulation but not in the real world. These include the exact linear/angular velocities of the tray and object, and their projected gravity.
- Integration Mechanisms: Two adapter types are explored:
  - Residual Action Adapter: Adds a corrective action vector to the base action.
  - Residual FiLM Adapter: Modulates the intermediate activations of the frozen base policy layers using Feature-wise Linear Modulation (FiLM).
Distillation (Student): To enable real-world deployment, the privileged encoder is distilled into a Student Encoder using the DAgger algorithm.
- The student encoder is trained to mimic the teacher's latent features and actions using only deployable observations (proprioception + visual object pose from a camera).
- The adapter remains frozen during this phase.

C. Key Training Design Choices

Observation Delay: To simulate real-world perception latency, object-related observations are delayed by a sampled time step $\ell_t$ . This forces the policy to learn temporally consistent control rather than relying on instantaneous (and noisy) data.
Domain Randomization: Extensive randomization of physical parameters (mass, friction, center of mass), object scales, and external push perturbations ensures robustness.
Reward Design:
- Base Reward: Focuses on locomotion (velocity tracking, torso stability, foot impact minimization).
- Stabilization Reward: Sparse rewards encouraging the object to remain upright and in contact with the tray.

3. Key Contributions

ReST-RL Framework: A novel residual student-teacher RL approach that augments a pre-trained locomotion policy with a stabilization module, successfully separating the objectives of walking and balancing.
Ablation of Design Choices: Identification of critical factors for sim-to-real transfer, specifically the necessity of observation delays and domain randomization to handle perception latency and physical uncertainties.
Benchmarking: Introduction of the SteadyTray benchmark, evaluating performance on variable velocity tracking and robustness against external force disturbances (pushes to the robot and the object).

4. Experimental Results

Simulation Results (Isaac Lab)

Performance: ReST-RL significantly outperforms both the Base Policy and End-to-End baselines.
- Success Rate: Achieved 96.9% success in variable velocity tracking (vs. 47.4% for Base Policy).
- Robustness: Achieved 74.5% success rate against external force disturbances (vs. 9.1% for Base Policy).
Ablation: The FiLM adapter on a Whole-Body (WB) base performed best. Crucially, ReST-RL trained with observation delays consistently outperformed models trained without them, even in zero-latency scenarios, proving that delay improves general stabilization.
Comparison: End-to-End models achieved decent locomotion but failed to stabilize the payload under pushes, highlighting the difficulty of optimizing both objectives simultaneously without architectural separation.

Real-World Deployment (Unitree G1)

Zero-Shot Sim-to-Real: The policy was deployed on a Unitree G1 humanoid (29 DoF) without fine-tuning.
Scenarios Tested:
- External Disturbances: Successfully stabilized the tray when the robot was kicked or when the object was pushed.
- Diverse Payloads: Successfully transported various objects including a coffee cup, a wine glass filled with liquid, medical tools, and food containers.
Outcome: The robot maintained tray levelness and prevented payload tipping/spilling across all tested scenarios, demonstrating highly reliable generalization.

5. Significance and Conclusion

Engineering Impact: This work solves a major bottleneck for humanoid service robots: the ability to carry unsecured, fragile, or liquid payloads in dynamic, human-centric environments (e.g., hospitals, restaurants).
Methodological Insight: It demonstrates that decoupling complex control tasks (locomotion vs. manipulation) via residual learning is superior to monolithic end-to-end learning for tasks requiring high-frequency stabilization.
Future Directions: The authors note limitations in current object encoding (single object, coarse geometry) and perception (narrow camera FOV). Future work aims to integrate foundation models for better object representation and extend the framework to contact-rich tasks like door opening or cart pushing.

In summary, SteadyTray establishes a new state-of-the-art for humanoid loco-manipulation by proving that a modular, residual reinforcement learning approach can achieve robust, zero-shot transfer for complex balancing tasks in the real world.