TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size

Imagine you are trying to move a giant, heavy dining table across a room. If you try to do it alone, you'll likely tip it over or drop it. If you get a group of friends to help, you need to coordinate: "I'll grab the left side, you take the right, and let's walk forward together."

Now, imagine teaching a robot to do this. The hard part isn't just teaching one robot to walk; it's teaching a team of robots to work together, no matter if there are 2 of them or 20, and no matter if the table is round, square, or rectangular.

This paper introduces TeamHOI, a new "brain" for robots that solves this exact problem. Here is how it works, explained simply:

1. The Problem: The "One-Size-Fits-None" Trap

Previous robot training methods were like teaching a choir to sing a song where the number of singers was fixed.

If you trained a robot team for 2 people, it didn't know how to act when 4 people showed up.
If you trained it for 4 people, it got confused with 8.
Also, most robots only learned from watching one person move. But moving a table with 8 people looks very different from moving it with one person. The robots didn't have enough "movies" to learn from.

2. The Solution: The "Universal Team Captain"

TeamHOI creates a single, super-smart policy (a set of rules) that works for any team size. Think of it as a universal translator for teamwork.

The Transformer Brain: Instead of a rigid brain that expects a fixed number of inputs, TeamHOI uses a Transformer (the same tech behind modern AI chatbots). Imagine each robot has a "team token" in its brain. It can look at its teammates, count them, and adjust its behavior instantly. Whether there are 2 teammates or 8, the brain knows how to listen and coordinate.
The "Masked" Learning Trick: Since we don't have video footage of 8 people lifting a table together, the researchers used a clever trick. They took videos of one person walking and "masked" (hid) their hands.
- The robot learns to walk like the human (keeping the motion realistic).
- But for the hands, it ignores the human video and instead learns through trial and error to grab the table correctly.
- Analogy: It's like learning to drive a car by watching a video of someone driving, but you cover their hands on the steering wheel. You learn the road rules from the video, but you figure out how to steer the wheel yourself based on the road conditions.

3. The Secret Sauce: "Formation Rewards"

Getting a group of robots to lift a table is hard because they might all crowd on one side, causing the table to flip.

The researchers invented a special "score" called a Formation Reward.
Imagine a magnet that gently pushes the robots to spread out evenly around the table's "balance points" (like the spokes of a wheel).
This reward doesn't care if the table is round or square, or if there are 3 robots or 10. It just says, "Spread out so the weight is balanced," and the robots figure out the rest.

4. The Results: From 2 to 8 (and beyond!)

The researchers tested this in a simulation with human-like robots (humanoids) carrying tables of different shapes.

The Test: They asked the robots to carry the table from point A to point B.
The Outcome:
- Old methods: If you trained them for 2 robots, they failed miserably when you added more. They would bump into each other or drop the table.
- TeamHOI: The same single brain worked perfectly for 2, 4, 6, and even 8 robots. They moved in perfect sync, like a well-rehearsed dance troupe.
- Heavy Lifting: They even tested with a table that was 5 times heavier. While other methods failed, TeamHOI's 8-robot team successfully lifted and carried the heavy load together.

Why Does This Matter?

This isn't just about robots carrying tables. It's a breakthrough for:

Robotics: Imagine a warehouse where the number of robots needed changes every day based on how many packages arrive. This system lets you add or remove robots without retraining the whole system.
Video Games & Movies: Instead of animating every single character in a crowd scene manually, directors could use this to generate realistic, coordinated group movements (like a crowd running from a monster or a dance team performing) automatically.

In short: TeamHOI taught robots how to be a flexible, adaptable team that can instantly adjust to any group size and any object, all by learning from a single "universal" set of rules.

Here is a detailed technical summary of the paper "TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size."

1. Problem Statement

The paper addresses the challenge of Cooperative Human-Object Interaction (HOI) in physics-based simulation. While single-agent humanoid control has advanced significantly, extending these capabilities to multi-agent cooperation faces two primary limitations:

Scalability and Flexibility: Existing methods typically rely on fixed-size Multi-Layer Perceptron (MLP) policies or lack explicit inter-agent communication. This restricts them to specific team sizes (e.g., a policy trained for 2 agents cannot function with 4 or 8). Real-world cooperation requires agents to adapt dynamically to varying team compositions without retraining.
Data Scarcity and Motion Diversity: Physics-based HOI frameworks often use Adversarial Motion Priors (AMP) to ensure motion realism by regularizing policies against reference motion data. However, high-quality reference data for cooperative multi-human activities is scarce. Existing approaches rely on single-human demonstrations, which limits the diversity of cooperative behaviors (e.g., agents are forced to mimic a single demonstrator's motion, failing to adapt to complex group dynamics).

2. Methodology: TeamHOI Framework

The authors propose TeamHOI, a unified decentralized framework that learns a single policy capable of handling cooperative HOI tasks with any number of agents ( $N$ ) and varying object geometries.

A. Transformer-Based Policy Network

To overcome the fixed-input limitation of MLPs, TeamHOI employs a Transformer architecture:

Teammate Tokens: Each agent observes its own state (proprioception, goal) and encodes the states of other agents as "teammate tokens." These tokens represent the position, heading, and relative angle of teammates in the observing agent's local frame.
Attention Mechanism: The policy uses self-attention to process the agent's own state and cross-attention to attend to teammate tokens. This allows the network to dynamically scale to any team size ( $N$ ) without changing the network architecture or parameters.
Unified Training: The policy is trained in parallel environments with varying team sizes (2 to 8 agents). To ensure stable training across heterogeneous data, the authors implement team-size-specific advantage normalization for the PPO algorithm, preventing reward scale distortions between different team configurations.

B. Masked Adversarial Motion Prior (Masked AMP)

To address the lack of multi-human reference data, the authors introduce a Masked AMP strategy:

Dual Discriminators: Two discriminators are trained:
1. Full-body Discriminator ( $D_{full}$ ): Evaluates the entire body against single-human reference motions (e.g., walking).
2. Masked Discriminator ( $D_{mask}$ ): Evaluates the body excluding the parts interacting with the object (e.g., hands and forearms are masked out).
Blended Reward: During training, the style reward is a weighted blend of the two discriminators based on an interaction indicator ( $\alpha_t$ ). When an agent interacts with the object, the policy relies on $D_{mask}$ for the interacting limbs and $D_{full}$ for the rest of the body.
Benefit: This allows the policy to learn diverse, task-specific interactions (e.g., lifting a table sideways) using only single-human reference motions (e.g., sideways walking), as the task rewards guide the masked regions rather than the rigid reference data.

C. Formation and Task Rewards

For the specific task of cooperative table carrying, the authors design a formation reward that is agnostic to table shape and team size:

Angular Spread Reward ( $r_{ang}$ ): Encourages agents to spread evenly around the object's perimeter.
Principal-Axes Coverage Reward ( $r_{cov}$ ): Ensures agents distribute their support along the object's principal axes of rotational stability. This prevents agents from clustering on one side, which would cause the object to tip.
Task Pipeline: The task involves walking to the object, inferring contact points, lifting, transporting to a target, and putting it down.

3. Key Contributions

Unified Decentralized Policy: A single Transformer-based policy that generalizes to any team size (2–8 agents) and object configuration without retraining or fine-tuning.
Scalable Coordination via Teammate Tokens: The use of cross-attention allows agents to perceive and coordinate with teammates dynamically, mimicking real human social perception.
Masked AMP Strategy: A novel method to expand the diversity of cooperative behaviors by masking object-interacting body parts during motion prior supervision, enabling the emergence of complex group behaviors from single-agent data.
Shape- and Size-Agnostic Formation Reward: A reward function that guides agents to form stable lifting configurations regardless of the object's geometry (square, rectangular, round) or the number of agents.

4. Experimental Results

The framework was evaluated on a challenging cooperative table-carrying task involving square, rectangular, and round tables with varying weights.

Success Rates: TeamHOI achieved high success rates (97.5% – 99.2%) across 2, 4, and 8 agents using a single policy. In contrast, baselines (adapted CooHOI*) trained on specific team sizes failed to generalize (e.g., a 2-agent policy failed with 8 agents).
Heavy-Load Performance: Under a 5x weight increase, TeamHOI maintained effective cooperation among 8 agents (81.1% success), whereas baselines failed completely.
Motion Quality: The policy demonstrated high cooperative time ratios (agents maintaining contact) and low jerk (smooth motion), indicating stable and synchronized teamwork.
Zero-Shot Generalization: The policy successfully generalized to unseen team sizes (up to 16 agents) and unseen table geometries (smaller and larger tables) without retraining.
Ablation Studies:
- Removing Masked AMP led to failed coordination during lifting due to conflicts between motion realism and task requirements.
- Removing the Principal-Axes Coverage Reward resulted in unstable formations and unnatural diagonal stepping patterns.

5. Significance

TeamHOI represents a significant step forward in embodied AI and multi-character animation.

Scalability: It solves the "fixed-size" bottleneck in multi-agent reinforcement learning, enabling systems that can dynamically scale from small groups to large crowds.
Data Efficiency: By leveraging masked AMP, it reduces the dependency on scarce multi-human motion capture data, making it feasible to train complex cooperative behaviors using readily available single-agent datasets.
Applications: The framework has broad implications for robotics (swarm logistics, collaborative manufacturing), virtual reality (multi-player games), and film animation (generating realistic group interactions).

In summary, TeamHOI demonstrates that a single, decentralized policy can learn to coordinate complex physical interactions across variable team sizes and object configurations, bridging the gap between single-agent control and scalable multi-agent cooperation.