Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation

Imagine you are hosting a dinner party, and you've invited a robot friend to help you cook.

In the old days, robots were like obedient butler bots: you had to give them a strict list of instructions ("Chop the onions," "Turn on the stove"), and they would try to do exactly that. If you asked them to do something they couldn't do (like "cut the steak with a plastic spoon"), they would just keep trying until they broke something, or they would freeze. They never said, "Hey, I can't do that, but I can get you a knife."

MICoBot (Mixed-Initiative Collaborative Robot) is different. Think of MICoBot not as a butler, but as a smart kitchen partner.

The Core Idea: A Two-Way Conversation

The paper introduces a system where both you and the robot can take the lead in the conversation.

You can say, "Hey, can you grab the scissors?"
The Robot can say, "I can grab the scissors, but I can't cut the package. Can you do that part?"

This is called Mixed-Initiative. It means neither of you is stuck in a "boss vs. worker" role. You are a team negotiating who does what based on who is better at the job right now.

How MICoBot Thinks (The Three-Layer Brain)

To make this work, MICoBot uses a three-step thinking process, like a manager, a strategist, and a worker all in one:

The Manager (Meta-Planner): This part listens to your conversation. If you say, "I'm tired today," the Manager updates the plan. It writes a little piece of computer code that says, "Okay, since the human is tired, let's have the robot do more heavy lifting." It adapts the rules of the game in real-time.
The Strategist (Planner): This part looks at the to-do list. It asks two questions:
- Can the robot do this? (It checks a "skill database" built from thousands of simulations).
- Is the human willing to help? (It listens to your tone. If you sound grumpy or busy, it knows you might say "no," so it tries to do the task itself or finds a different way).
- It then calculates the best split: "I'll do the heavy lifting, you do the delicate cutting."
The Worker (Action Executor): This is the part that actually moves the robot's arms or speaks to you. If the plan says "Robot brings scissors," the Worker moves the robot. If the plan says "Robot asks for help," the Worker generates a polite sentence like, "Could you open this for me?"

The Real-World Test: The "Party Prep" Challenge

The researchers tested this with 18 real people and a robot arm in a fake apartment. They gave them three messy tasks:

Pouring a package: Bringing a bowl and package, cutting it open, and pouring it. (Robots are bad at cutting; humans are good).
Assembling a toy car: Bringing parts, drilling wheels, and screwing things together. (Robots are bad at fine motor skills like drilling; humans are good).
Packing a gift box: Folding boxes, wrapping ribbons, and taping bows. (Robots are bad at delicate ribbon work).

The Results:

The Old Way (LLM Baseline): A standard AI chatbot tried to be the boss. It often tried to do things it couldn't do (like cutting the package), failed, and the whole task fell apart. Success rate: 28%.
The MICoBot Way: MICoBot realized, "I can't cut this. I'll ask my human partner." It negotiated, adapted when the human was busy, and took over when the human was tired. Success rate: 78%.

Why It Matters

Think of MICoBot as the difference between a scripted video game character and a real-life teammate.

Scripted Character: "I will follow your orders until I crash."
Real Teammate: "I see you're struggling with that box. Let me hold it while you tape it. Or, if you're busy, I'll try to do it myself, but I might need a hand."

The paper proves that for robots to be truly helpful in our homes, they need to stop just "listening" and start talking back, negotiating, and understanding that humans are unpredictable, sometimes tired, and sometimes very willing to help. MICoBot is the first system to master this dance of "who does what" using natural conversation.

1. Problem Statement

The paper addresses the challenge of long-horizon human-robot collaborative manipulation in unstructured environments (e.g., household tasks). Current systems often suffer from rigid collaboration models:

One-directional interaction: Most AI/LLM systems wait for human commands, while traditional HRI systems often assume fixed plans and full human compliance.
Lack of adaptability: Robots struggle to adapt to varying human capabilities, willingness to help, and changing contexts.
Inefficient task allocation: Systems often fail to balance the trade-off between maximizing task success and minimizing human effort, frequently assigning tasks to agents (human or robot) that are incapable of performing them.

The authors propose a shift toward Mixed-Initiative Dialog, where both the human and the robot can proactively propose, accept, or reject task steps, negotiate roles, and adapt strategies in real-time using natural language.

2. Methodology: MICoBot Framework

MICoBot (Mixed-Initiative Collaborative roBot) is a hierarchical system designed to optimize task allocation through a three-level decision-making process. It models the interaction as a Markov Decision Process (MDP) where agents perform both physical actions ( $A_p$ ) and verbal actions ( $A_v$ ).

A. Three-Level Architecture

Level 1: Meta-Planner (Strategy Generation)
- Function: Uses a Large Language Model (LLM, GPT-4o) to generate adaptive planning code based on the current symbolic state, task plan, and dialog history.
- Output: It produces two code modules:
  - Task Allocation Code: Maps human dialog into constraints for the optimization problem (e.g., "Human wants to do step X").
  - Action Selection Code: Decides the high-level strategy (e.g., "Negotiate," "Propose split," "Execute").
Level 2: Iterative Planner (Optimization & Decision)
- Function: Executes the code from L1 to solve a constrained optimization problem.
- Objective: Find the optimal task allocation $G^*$ that maximizes task success probability while minimizing human effort.
- Optimization Function:
  $\max_{G} \sum_{t} \left( \mathbb{1}_{g_t=H} \cdot \frac{\alpha}{p_{H,t}} + \mathbb{1}_{g_t=R} \right) Q_{g_t}(s_t, a_t)$
  - $Q_{R}$ and $Q_{H}$ : Agent-specific Q-functions representing the expected time-to-success (effort) and likelihood of failure.
  - $\alpha$ : A factor weighting human effort higher than robot effort.
  - $p_{H,t}$ : The estimated probability of the human agreeing to help, inferred from dialog sentiment.
  - Constraints: The system enforces constraints derived from human requests. If no feasible allocation exists (e.g., human asks robot to do an impossible task), the planner iteratively relaxes constraints and explains the limitation verbally.
- Q-Function Estimation:
  - Robot ( $Q_R$ ): Trained via supervised learning on simulation data (OmniGibson) to predict timesteps for success/failure.
  - Human ( $Q_H$ ): Estimated using an LLM to predict human execution time + travel time, assuming perfect competence but variable willingness ( $p_{H,t}$ ).
Level 3: Action Executor (Execution)
- Function: Executes the selected primitive action.
- Physical: Generates low-level trajectories for navigation and manipulation (using ROS move_base, Grounding DINO for object detection, and Inverse Kinematics).
- Verbal: Uses an LLM with In-Context Learning (ICL) to generate natural language utterances (requests, responses, negotiations) grounded in the task context.

B. Key Mechanisms

Hierarchical Planning: The system groups low-level steps into high-level abstract actions to reduce dialog complexity, only descending to granular details during negotiation.
Dynamic $p_{H,t}$ Estimation: The system continuously updates the probability of human helpfulness based on sentiment analysis of the dialog history, allowing it to adapt to reluctant or overly proactive users.

3. Key Contributions

New Problem Setting: Integrates mixed-initiative natural language dialog with physical human-robot interaction, moving beyond single-initiative (human-led or robot-led) paradigms.
Optimization Framework: A novel objective function that balances task success, human effort, and human preferences (constraints) using a unified Q-value metric.
Hierarchical Robotic System: The first system to enable seamless speech-to-speech mixed-initiative collaboration for long-horizon physical tasks, adapting to diverse human collaborators.
Simulation Environment: A collaborative simulation framework (built on MiniBehavior) featuring LLM-controlled virtual humans with parametric helpfulness and mood.

4. Experimental Results

The system was evaluated in real-world trials with 18 unique human participants on a TIAGo mobile manipulator across three household tasks (Pouring, Assembling a Toy Car, Packing a Gift Box) and in simulation.

Quantitative Results (Real-World)

Task Success Rate: MICoBot achieved 77.8% success compared to 27.8% for a pure LLM baseline (statistically significant, $p=0.007$ ).
Step Completion: MICoBot completed 93.8% of task steps vs. 58.2% for the baseline.
User Preference: 77.8% of participants preferred MICoBot over the LLM baseline.
User Satisfaction: MICoBot scored significantly higher on Likert scales for overall satisfaction, communicative ability, and awareness of limitations.

Qualitative & Ablation Insights

Mixed-Initiative Necessity: Ablation studies showed that restricting dialog to single-initiative modes (Robot-only or Human-only) drastically reduced success rates. The full mixed-initiative approach was critical for negotiating impossible tasks.
Adaptability: MICoBot successfully persuaded reluctant users to perform necessary steps and refused to attempt tasks it knew it couldn't do, whereas the LLM baseline often "over-promised" and failed.
Efficiency: While MICoBot engaged more human effort (40% of steps vs. 18% for baseline), it did so more effectively to ensure task completion, outperforming even "oracle" baselines that assumed perfect human cooperation.

5. Significance and Impact

Paradigm Shift: The paper demonstrates that effective human-robot teams require bidirectional agency. Robots must be able to initiate requests and negotiate, not just follow commands.
Robustness: By explicitly modeling human willingness ( $p_{H,t}$ ) and robot affordances, the system avoids the "hallucination" of capabilities common in pure LLM agents, leading to higher reliability in physical tasks.
Scalability: The use of code-generation (Meta-Planner) allows the system to adapt its logic to new constraints and tasks without retraining the underlying policy, making it suitable for diverse, long-horizon scenarios.

In conclusion, MICoBot establishes a new standard for collaborative robotics by proving that mixed-initiative dialog is not just a communication tool but a critical control mechanism for optimizing task allocation and ensuring successful human-robot teamwork.