Dual-Agent Multiple-Model Reinforcement Learning for Event-Triggered Human-Robot Co-Adaptation in Decoupled Task Spaces

Imagine you are trying to guide a very large, heavy robotic arm to pick up a cup of coffee. You have a button that tells the robot "Up" or "Down," but the robot has to figure out how to move its elbow, wrist, and shoulder to actually get there without shaking or overshooting.

This paper describes a new, smarter way for a human and a robot to work together on this task, specifically for helping people relearn how to move their arms after an injury.

Here is the breakdown of their invention, using some everyday analogies:

1. The Problem: The "Shaky Hand" Effect

In old robotic systems, the robot and the human were like two people trying to walk in step while looking at their watches. The robot checked its position every 100 milliseconds, regardless of whether it had actually finished moving.

The Analogy: Imagine trying to park a car by checking your position every second, even if you haven't stopped moving yet. You'd likely overcorrect, jerk the wheel left, then right, then left again. In robotics, this is called "chatter" or oscillation. The robot gets nervous, shakes around the target, and never quite settles.

2. The Solution: The "Admission Sphere" (Event-Triggered Control)

Instead of checking the clock, the new system uses a "magic bubble" around the target.

The Analogy: Think of the target as a bullseye. The robot is only allowed to take its next step once it has actually floated inside a specific bubble (an admission sphere) around the target.
How it helps: The robot doesn't rush. It waits until it is physically stable and inside the bubble before asking, "Okay, where do I go next?" This stops the shaking. It's like waiting for a boat to stop rocking before you try to step off onto the dock.

3. The Team: The Human and the Robot as Co-Pilots

The system splits the job into two distinct roles, like a driver and a GPS navigator.

The Human (The Driver): You only have to make two simple choices:
1. Direction: "Up" or "Down" (using a simple button or sensor).
2. Tolerance: "How close do I need to be?" You can choose a Big Bubble (I want to go fast, I don't mind being a little off) or a Small Bubble (I want to be super precise, take my time).
The Robot (The GPS & Mechanic): The robot handles all the complicated math. It figures out how to move the elbow, wrist, and shoulder to get you there. Crucially, it adjusts its own "stride."
- If you chose the Big Bubble (Fast mode), the robot takes long strides to get there quickly.
- If you chose the Small Bubble (Precision mode), the robot takes tiny, careful steps to ensure you hit the mark perfectly.

4. The Brain: "Dual-Agent Multiple-Model Learning" (DAMMRL)

This is the fancy part. The robot isn't just following a manual; it's learning how you think.

The Analogy: Imagine a dance partner who has practiced with 8 different versions of you.
- Version A: You are fast but make mistakes.
- Version B: You are slow but very accurate.
- The robot uses Reinforcement Learning (trial and error) to figure out which "version" of you is dancing today. It then picks the perfect dance move (step size) to match your style.
The Training: They didn't just throw this at a real robot immediately. They trained it in a video game (MuJoCo simulation) first, then let real humans play with a virtual robot, and finally put it on the real machine. This is like a pilot training in a simulator before flying a real plane.

5. The Result

When they tested this new system:

No more shaking: The "magic bubble" stopped the robot from jittering.
Better teamwork: The robot learned to match the human's speed. If the human wanted to rush, the robot rushed (safely). If the human wanted to be careful, the robot slowed down.
Success: People were able to grab objects more often and more smoothly than with traditional robots.

Summary

This paper presents a rehabilitation robot that stops "overthinking" and shaking. Instead of moving on a strict timer, it waits until it's stable. It then acts like a smart dance partner, instantly adjusting its speed and precision to match exactly how the human patient wants to move, making the recovery process smoother, safer, and more effective.

Here is a detailed technical summary of the paper "Dual-Agent Multiple-Model Reinforcement Learning for Event-Triggered Human-Robot Co-Adaptation in Decoupled Task Spaces."

1. Problem Statement

The paper addresses critical challenges in robot-assisted upper-limb rehabilitation, specifically focusing on the control of a custom 6-degree-of-freedom (6-DoF) manipulator. Two primary bottlenecks are identified:

Waypoint Chatter and Oscillations: Traditional fixed-frequency control strategies often induce trajectory oscillations near target waypoints. This occurs because the execution time for inverse kinematics (IK) varies, causing the controller to issue new commands before the robot physically settles at the previous subgoal.
Human-Robot Co-Adaptation: Existing shared-control policies struggle to balance the user's inherent speed-accuracy trade-off with the robot's assistance. Continuous online adaptation is computationally expensive and unstable, while static policies fail to account for individual cognitive variability and changing user states.

2. Methodology

The authors propose a novel framework combining Axial Decomposition, Event-Triggered Control, and Dual-Agent Multiple-Model Reinforcement Learning (DAMMRL).

A. System Architecture & Axial Decomposition

Role Allocation: The task space is decoupled.
- Human Agent (Agent0): Governs the primary reaching axis (e.g., vertical $z$ -axis) using binary commands ( $+1/-1$ ) derived from wearable sensors (IMU, EMG, EEG) or direct interfaces (pressure sensors). The human also selects an admission sphere radius ( $\varepsilon \in \{E_{big}, E_{small}\}$ ), representing their desired speed-accuracy trade-off.
- Robot Agent (Agent1): Autonomously manages orthogonal corrective motions ( $x$ and $y$ axes) and determines the precise step magnitudes ( $\delta$ ) for all three axes.
Control Strategy:
- Event-Driven Progression: Instead of fixed-time updates, the system triggers the next control action only when the end-effector enters an "admission sphere" centered on the current subgoal and the Lyapunov energy surrogate ( $\dot{V} \leq 0$ ) indicates convergence. This eliminates chatter.
- Dynamics-Consistent Control: The system uses numerical inverse kinematics (via ikpy) to map Cartesian micro-steps to joint space, followed by Computed Torque Control (CTC) with inverse dynamics to compensate for inertia, Coriolis, and gravity forces, ensuring smooth physical interaction.

B. Dual-Agent Multiple-Model Reinforcement Learning (DAMMRL)

To handle inter-individual variability without heavy online adaptation, the system discretizes decision-making into a finite set of multi-model combinations $\mathcal{M} = \{M_{i,j}\}$ :

Human Model ( $i$ ): Represents the user's selected admission sphere radius ( $E_{big}$ for high speed/low accuracy, $E_{small}$ for low speed/high accuracy).
Robot Model ( $j$ ): Represents the robot's 3D step magnitude vector ( $\delta_x, \delta_y, \delta_z$ ), where each axis can take a "small" or "large" step value.
Training Curriculum: The framework employs a staged training pipeline:
1. Virtual (Sim-Sim): Both agents simulated in MuJoCo for initial RL training.
2. Semi-Virtual (Human-Sim): Real human interacts via a pressure sensor with a virtual robot to refine models based on actual decision frequency and accuracy.
3. Real (Human-Real): Deployment on physical hardware (planned for future work).

C. Reward Function

The RL agent optimizes a composite reward function balancing:

Spatial Accuracy: Penalizing tracking error ( $\|e\|$ ).
Temporal Efficiency: Penalizing execution time ( $t_{step}$ ).
Stability: Penalizing waypoint oscillations ( $N_{osc}$ ).
Success: Rewarding target acquisition.

3. Key Contributions

Axial Role Allocation: A novel policy reducing human intent decoding to robust binary decisions while preserving user agency, with the robot handling orthogonal corrections.
Event-Triggered Admission Spheres: A progression criterion that suppresses waypoint oscillations by synchronizing control updates with physical settling rather than fixed time intervals.
DAMMRL Framework: A DQN-based co-adaptation scheme that discretely matches human cognitive states (speed-accuracy trade-offs) with robot step magnitudes, optimizing collaboration without continuous online parameter tuning.
Staged Deployment Pipeline: A seamless transition from simulation to semi-virtual and physical environments, simplifying hardware tuning.

4. Results

Experiments were conducted in a MuJoCo simulation (S1) and a semi-virtual setup (S2) using healthy participants.

Oscillation Suppression: Compared to a fixed-frequency baseline, the event-driven approach significantly reduced waypoint chatter and micro-vibrations, resulting in smoother joint trajectories.
Adaptive Step Sizing:
- Reward 1 (Accuracy Focus): The agent learned to use small step sizes ( $\delta_{small}$ ) near targets to minimize error, resulting in high precision but longer execution times.
- Reward 2 (Balanced): The agent dynamically adjusted step sizes, using larger steps ( $\delta_{large}$ ) in mid-motion for speed and smaller steps near the target for precision. This achieved the best balance of speed and accuracy.
Convergence: The DAMMRL models converged successfully, demonstrating the ability to learn optimal pairings between human error rates and robot step magnitudes.
Human-Robot Interaction: In the semi-virtual setup, the system successfully stabilized the robot's movement under human control via a pressure sensor, validating the event-triggered logic in a mixed environment.

5. Significance

This work presents a significant advancement in rehabilitation robotics by addressing the "chatter" problem inherent in fixed-rate control and the difficulty of adapting to diverse human users.

Safety and Comfort: By using event-triggered updates and inverse dynamics, the system ensures smooth, predictable motions, reducing the risk of jerky movements that could harm patients.
Efficiency: The DAMMRL approach avoids the computational burden of continuous online learning while still providing personalized assistance through a finite model set.
Scalability: The staged training curriculum (Virtual $\to$ Semi-Virtual $\to$ Real) offers a robust methodology for deploying complex RL policies on physical hardware, bridging the "reality gap."

The proposed framework effectively balances spatial precision with temporal efficiency, offering a promising solution for high-dosage, task-oriented upper-limb rehabilitation.