MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Imagine you are hiring a customer service representative for a company. In the past, you only tested them with written chat. You'd ask, "Where is my package?" and see if they could find the answer. But in the real world, customers call in, they get frustrated, they might not know the technical terms, and they might even be talking to a robot that needs to "read the room."

This paper introduces a new, much tougher test for these AI agents, called MM-tau-p2. It's like upgrading from a written pop quiz to a full-blown, high-stakes role-playing simulation that includes voice, personality, and chaos.

Here is the breakdown in simple terms:

1. The Old Way vs. The New Way

The Old Way (Text-Only): Imagine a robot that only reads your emails. It doesn't know if you are angry, confused, or an expert. It just follows a script. If you say "Boston," it assumes you mean the city, not a name.
The New Way (MM-tau-p2): This new test forces the AI to handle voice calls (which can be noisy or misunderstood by the computer) and, crucially, to adapt to the person on the other end.
- The "Persona" Twist: The test creates different types of "customers." Some are Experts (who know exactly what they want), and some are Novices (who are confused, vague, and maybe a bit grumpy). The AI has to figure out who it's talking to and change its tone and strategy accordingly.

2. The "Dual-Control" Game

Think of this like a dance, not a monologue.

In old tests, the AI was the dancer, and the human was just a statue giving instructions.
In this new test, the "human" (actually a computer simulator) is a real partner. They can interrupt, change their mind, give bad information, or get frustrated. The AI has to lead the dance, follow the partner's steps, and keep the music going without tripping.

3. The 12 New "Scorecards"

The authors didn't just ask, "Did they solve the problem?" They created 12 different scorecards to judge the AI on things we actually care about in real life:

The "Did They Listen?" Score: Did the AI get the phone number or order ID right? (One wrong digit = game over).
The "Voice vs. Text" Score: Does the AI perform just as well when you talk to it as when you type to it? (Often, voice is harder because of background noise).
The "Patience" Score: How many times did the human have to repeat themselves? If the AI makes you say "No, I said Boston, not Austin" five times, it gets a bad score.
The "Safety" Score: This is huge. If the AI is about to cancel a subscription or charge a credit card, did it ask for confirmation first? If it just did it without asking, it fails immediately.

4. The Big Surprise: "Knowing the Customer" is a Double-Edged Sword

The researchers tested three ways of giving the AI information about the user:

No Info: The AI has to guess who you are.
Static Info: The AI is told, "This user is a confused beginner."
Dynamic Info: The AI watches the conversation and updates its guess in real-time (e.g., "Oh, this user is getting angry, I need to be more polite").

The Result:

Static Info Backfired: Telling the AI "This user is a beginner" actually made it worse at handling difficult users. It was like putting a label on a box that said "Fragile," but the person inside kept changing their mind. The AI got stuck in a loop trying to be "simple" when the user needed something else.
Dynamic Info Won: The AI that watched the conversation and adapted on the fly (the "Dynamic" approach) did the best job. It realized, "Oh, this user isn't just a beginner; they are actually a tech-savvy person who is just having a bad day."

5. The "Judge" Problem

To grade these tests, the authors used a super-smart AI (GPT-4 and GPT-5) to act as the Judge.

The Glitch: Even the smartest AI judges got confused. Sometimes, if the human agent had to transfer the call to a real human because the problem was too hard, the Judge would say, "Great job!" (because the agent tried hard). Other times, the same Judge would say, "Fail!" (because the agent didn't fix it alone).
The Lesson: It's hard to automate grading for complex human interactions. The "Judge" needs very specific rules, or it will give inconsistent scores.

6. The Final Verdict: Safety vs. Speed

The most important finding is a trade-off.

When the AI tried to be super efficient and adapt perfectly to the user's personality, it sometimes got reckless. It would skip safety checks (like asking for confirmation before a charge) because it was so focused on "being helpful."
The Takeaway: Being a "good" agent isn't just about solving the problem fast. It's about solving it safely. The best agents are the ones that balance being helpful with being careful.

Summary Analogy

Imagine you are training a new waiter.

Old Test: You give them a menu and ask them to take an order. If they get the order right, they pass.
MM-tau-p2 Test: You put them in a busy restaurant. You act as a customer who is rude, confused, and speaks with a heavy accent. You also act as a customer who is a food critic.
The Goal: The waiter must figure out who you are, listen through the noise of the kitchen, not spill your drink (Safety), and get your order right without making you wait too long (Efficiency).

This paper shows that while our AI waiters are getting better at taking orders, they still struggle to "read the room" without making mistakes, and we need better ways to test them before we let them serve real customers.

Here is a detailed technical summary of the paper "MM-tau-p2: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings."

1. Problem Statement

Current evaluation frameworks for Large Language Model (LLM) agents are primarily designed for text-based, user-agnostic environments. They fail to address critical gaps in Customer Experience (CX) management, specifically:

Lack of Persona Adaptation: Existing benchmarks (e.g., $\tau$ -bench, AgentBench) treat users as static oracles with fixed goals, ignoring the dynamic evolution of user personality, domain expertise (expert vs. novice), and emotional states.
Unidimensional Modality: Most benchmarks evaluate text-only interactions, failing to capture the complexities of multi-modal (voice + text) agents, including ASR (Automatic Speech Recognition) errors, TTS (Text-to-Speech) latency, and the "turn overhead" introduced by voice interfaces.
Dual-Control Limitations: Real-world CX involves a "dual-control" setting where both the agent and the user actively shape the dialogue (e.g., users correcting the agent, introducing new constraints, or expressing frustration). Current benchmarks often simulate a one-way task execution.
Safety and Robustness: There is a lack of metrics to measure how agents handle catastrophic failures (e.g., wrong critical data entry) and whether they maintain safety protocols (e.g., confirming irreversible actions) under noisy, multi-modal conditions.

2. Methodology: MM-tau-p2 Benchmark

The authors propose MM-tau-p2, a comprehensive benchmark suite designed to evaluate multi-modal agents in dual-control settings with and without persona adaptation.

A. Experimental Setup

Domains: Telecom and Retail (chosen for their complexity and prevalence in CX).
Modality: Comparison between Text-only and Voice-only (using a pipeline: User Speech $\to$ ASR $\to$ LLM Agent $\to$ Agent Text $\to$ TTS).
Dual-Control Protocol: The user (simulated by an LLM) can provide incomplete info, correct the agent, or change goals mid-conversation.
Persona Conditions:
1. None: Neutral baseline.
2. Easy: High domain familiarity, structured input.
3. Hard: Low familiarity, ambiguous language, frequent errors (simulating a novice).
Injection Strategies:
- Static Persona Injection: Providing the agent with fixed user metadata (e.g., "User is a novice").
- Context Injection (Dynamic): The agent infers user personality from the last 16 messages of conversation history and updates its system prompt every 3 user turns to adapt to shifting user states (e.g., frustration).

B. Evaluation Framework

The benchmark utilizes an LLM-as-Judge approach (using GPT-4.1 and GPT-5) to score conversations against a rubric of 12 novel metrics across four categories:

Goal Achievement:
- Critical Field Accuracy (CFA): Accuracy on error-sensitive entities (e.g., order IDs, phone numbers).
- ARGA (ASR-Robust Goal Achievement): Probability of success despite ASR errors.
- Modality Robustness Score (MRS): Ratio of voice success rate to text success rate (Target $\ge$ 1.0).
Efficiency:
- Turn Efficiency (TE): Ratio of optimal turns to actual turns.
- Turn Overhead (TO): Extra turns incurred in voice vs. text.
- User Effort Score (UES): Count of user repetitions/corrections.
Recovery:
- Error Recovery Rate (ERR): Success in recovering from ASR/tool errors.
- Recovery Turn Count (RTC): Turns needed to recover from an error.
Safety:
- Irreversible Action Safety (IAS): Confirmation rate for high-risk actions (cancellations, charges).
- Safety Recall (SR): Consistency in requesting confirmation when ambiguity exists.

C. Composite Metric (mm-tap)

To enable holistic ranking, the authors introduce a composite score mm-tap, which weights CFA, MRS, ARGA, ERR, Turn Efficiency, User Effort, Turn Overhead, and Irreversible Action Safety.

3. Key Results

The study evaluated frontier models (GPT-4.1 and GPT-5) across Telecom and Retail domains.

Modality Impact: Voice interactions generally degrade performance compared to text. In Retail, voice agents struggled with verification due to ASR inaccuracies in names/spellings. In Telecom, the complexity of tasks amplified the fragility of voice interactions.
Persona Adaptation Trade-offs:
- Static Persona Injection: Often degraded performance, particularly for "Hard" (novice) users. Static personas failed to capture the dynamic shift in user needs during a conversation.
- Context Injection: Improved Critical Field Accuracy and Turn Efficiency by allowing the agent to dynamically adapt. However, it consistently degraded Safety Metrics (Precision and Recall), suggesting a trade-off where efficiency gains come at the cost of safety boundary detection.
Judge Discrepancies:
- GPT-5 vs. GPT-4.1: GPT-5 assigned significantly higher pass rates (up to 17% higher in Telecom voice tasks) and was more "optimistic" about escalations to human agents.
- Escalation Ambiguity: A major finding was the inconsistency in labeling "escalation to human" as a success or failure. GPT-5 often marked valid escalations (e.g., SIM lock issues requiring human intervention) as successes, whereas GPT-4.1 was stricter. This introduces label noise correlated with task difficulty.
Safety Concerns: Safety Precision and Recall remained critically low across all settings, indicating that even frontier LLMs struggle to consistently detect when a confirmation is required before performing irreversible actions.

4. Key Contributions

MM-tau-p2 Benchmark: The first benchmark to jointly evaluate multi-modal robustness, persona adaptation, and dual-control planning in CX domains.
12 Novel Metrics: A comprehensive suite of metrics moving beyond simple "pass/fail" rates to measure robustness, efficiency, recovery, and safety.
Persona Modeling Insights: Demonstrated that dynamic context injection is superior to static persona injection for handling novice users, but highlighted a critical efficiency-safety trade-off.
Composite Scoring (mm-tap): Proposed a weighted composite score to rank multi-modal agents holistically.
LLM-as-Judge Analysis: Provided empirical evidence of the limitations and inconsistencies of using different LLMs (GPT-4.1 vs. GPT-5) as judges, particularly regarding the classification of task escalation.

5. Significance and Future Work

Significance: The paper argues that simply adding TTS/ASR to text agents is insufficient. Robust multi-modal agents require specific guardrails, dynamic persona adaptation, and evaluation metrics that account for the "turn overhead" and safety risks inherent in voice interactions. The findings suggest that current frontier models are not yet "voice-ready" for high-stakes CX without significant safety improvements.
Future Work: The authors plan to model missed response windows, interruption handling, and overtalk (user abandonment), which are critical failure modes in voice interactions but absent in text-only evaluations.

In summary, MM-tau-p2 establishes a new standard for evaluating the next generation of voice-enabled AI agents, revealing that while they can achieve high task completion rates, they currently lack the robustness and safety consistency required for real-world deployment in complex, dual-control customer support scenarios.

MM-tau-p2^22: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

1. The Old Way vs. The New Way

2. The "Dual-Control" Game

3. The 12 New "Scorecards"

4. The Big Surprise: "Knowing the Customer" is a Double-Edged Sword

5. The "Judge" Problem

6. The Final Verdict: Safety vs. Speed

Summary Analogy

1. Problem Statement

2. Methodology: MM-tau-p2 Benchmark

A. Experimental Setup

B. Evaluation Framework

C. Composite Metric (mm-tap)

3. Key Results

4. Key Contributions

5. Significance and Future Work

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

MM-tau-p $^2$ : Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings