Non-Collaborative User Simulators for Tool Agents

Imagine you are building a robot butler designed to help people book trains, reserve restaurants, or buy tickets. You want to make sure this robot is smart, polite, and efficient.

The Problem: The "Too Nice" Training Camp
So far, most researchers have trained these robot butlers using a "training camp" where the human users are incredibly polite, patient, and cooperative. It's like teaching a swimmer in a calm, heated pool with no waves. The robot learns to follow instructions perfectly because the "students" never make mistakes, never get angry, and never ask for things the robot can't do.

But in the real world? Real people are messy. They get impatient, they type half-sentences, they ask for things that don't exist, and they sometimes just want to chat about their day instead of booking a ticket. When these real-world robots meet actual humans, they often crash, get confused, or give up.

The Solution: The "Chaos Simulator"
This paper introduces a new tool: a Non-Collaborative User Simulator. Think of this as a "villain generator" for your robot butler. Instead of training with perfect students, this simulator creates four specific types of "difficult" users to test the robot's limits:

The "Impossible Requester" (Unavailable Services): This user asks for things the robot simply cannot do.
- Analogy: Imagine asking a robot butler to "Book me a table at a restaurant that doesn't exist yet" or "Get me a window seat on a train that only has open benches." The robot has to learn how to say "No" gracefully without breaking down.
The "Chatterbox" (Tangential): This user keeps changing the subject.
- Analogy: You ask the robot to book a train, and it says, "Okay, but by the way, do you think aliens exist?" or "What's the best pizza in town?" If the robot ignores them, the user gets annoyed. The robot must learn to handle these side conversations without forgetting the main task.
The "Impatient Screamer" (Impatience): This user gets angry when things take too long or fail.
- Analogy: The robot is thinking, and the user starts yelling, "Hurry up! This is taking forever! I'm going to cancel my subscription!" The robot has to learn not to panic or apologize endlessly (which wastes time) but to stay focused on solving the problem.
The "Typo-Prone" (Incomplete Utterances): This user sends broken, half-finished messages.
- Analogy: Instead of saying "Book a train for two people," the user types "Book train 2..." and hits send. The robot has to be a mind-reader to figure out what was meant without getting confused.

What Happened When They Tested It?
The researchers took the smartest robot butlers available (the latest AI models) and put them through this "Chaos Simulator."

The Result: The robots struggled. Their performance dropped significantly.
The Specific Failures:
- When asked for impossible things, they kept trying to find the answer like a dog chasing its tail, wasting time.
- When users got angry, the robots apologized too much, which slowed them down and made the angry users angrier.
- When users sent broken messages, the robots started "hallucinating" (making up fake details) just to fill in the blanks, leading to errors.

The Big Lesson
The paper concludes that we can't just train robots on polite, perfect data. If we want them to work in the real world, we need to stress-test them with these difficult scenarios.

They also found that if you train a small, cheap robot only on "nice" data, it fails miserably when it meets a real, grumpy human. However, if you train it on a mix of "nice" and "difficult" data, it becomes much more robust.

In a Nutshell
This paper is like a driving school that finally stops teaching students only on empty, sunny roads. They are now adding potholes, angry pedestrians, and foggy weather to the training course. The goal is to make sure that when the robot butler finally goes to work, it doesn't crash when a real human says, "I'm in a hurry, and I want a unicorn for my birthday."

Here is a detailed technical summary of the paper "Non-Collaborative User Simulators for Tool Agents" (ICLR 2026).

1. Problem Statement

Tool agents (LLMs that interact with APIs to perform tasks) are increasingly evaluated using user simulators to conduct multi-turn dialogue testing. However, existing simulators and training datasets are predominantly agent-friendly, assuming users are cooperative, patient, and provide complete information. This creates a "cooperation bias" that fails to reflect real-world deployment scenarios where users exhibit non-collaborative behaviors.

The paper identifies a critical gap: current agents lack robustness against real-world user behaviors such as:

Requesting services beyond the agent's API capabilities.
Diverting conversations to unrelated topics.
Expressing impatience or anger due to delays/failures.
Providing incomplete or poorly articulated utterances.

Without testing against these behaviors, agents may appear successful in benchmarks but fail catastrophically in production, leading to hallucinations, dialogue breakdowns, and poor user retention.

2. Methodology

The authors propose a novel Non-Collaborative User Simulation Framework that extends existing collaborative simulators to model four distinct categories of non-collaborative behavior while maintaining goal alignment (ensuring the agent still receives all necessary information to complete the task).

A. Taxonomy of Non-Collaborative Behaviors

Based on marketing research and real-world user-LLM interaction data, the authors define four behavior types:

Unavailable Services: Users request functionalities that do not exist in the agent's API schema (e.g., asking for a "window seat" when the train API only supports seat selection by number).
Tangential: Users introduce unrelated topics (e.g., asking about international politics while booking a train) to seek social rapport or attention.
Impatience: Users express frustration, anger, or threats when the agent fails or delays, escalating from polite requests to aggressive demands.
Incomplete Utterances: Users send truncated messages (e.g., "Book train, 2") or prematurely sent inputs, forcing the agent to infer missing parameters.

B. Simulation Architecture

The framework builds upon a base collaborative user simulator (using GPT-4.1-mini) and integrates specific LLM modules to inject non-collaborative behaviors without breaking the task flow:

Goal Alignment Mechanism: A Dialogue State Tracker monitors information pieces required for the task. If the user attempts to end the dialogue before all information is conveyed, the system forces the insertion of missing details. An Ending Verifier prevents premature termination if the agent still needs to execute actions.
Behavior Injection Modules:
- Unavailable Service: An LLM analyzes the user goal and API docs to generate "impossible" sub-goals (e.g., requesting wheelchair accessibility on a train API that lacks it) and appends them to the user's request.
- Tangential: The system samples user personas and generates off-topic dialogue acts (factual questions, opinions). If the agent ignores these, a Complain Generator triggers a follow-up expressing dissatisfaction.
- Impatience: The system detects agent failures or delays. It triggers an Impatience Generator that produces aggressive utterances (belligerent abuse, threats, urging). A probabilistic mechanism escalates anger levels over time.
- Incomplete Utterance: Collaborative utterances are modified via Style Transfer (using patterns from LMSYS/WildChat datasets) to create truncated or fragmented messages. The state tracker ensures the missing info is re-communicated later.

C. Experimental Setup

Benchmarks: MultiWOZ (booking tasks) and $\tau$ -bench (complex airline/retail operations).
Models: Evaluated against SOTA agents including GPT-4.1-mini/nano, Qwen3 (30B/235B), and Llama-3.1-70B.
Metrics: Success Rate (SR) based on exact match of the final database state, and Goal Alignment (GA) to ensure the simulation was valid.

3. Key Contributions

Novel Simulation Framework: The first user simulator architecture that explicitly models four categories of non-collaborative behaviors while guaranteeing goal alignment, allowing for rigorous stress-testing of tool agents.
Comprehensive Benchmarking: Extensive evaluation across multiple benchmarks (MultiWOZ, $\tau$ -bench, ColBench, MINT) revealing that SOTA agents suffer significant performance degradation under non-collaborative conditions.
Failure Mechanism Analysis: Detailed identification of how agents fail:
- Unavailable Service: Agents get stuck in loops of re-fetching API documentation (redundant calls) or hallucinate API results.
- Tangential: Agents fail to complete core tasks due to distraction, often triggering user complaints that exhaust reasoning limits.
- Impatience: Agents over-apologize (due to RLHF alignment), wasting reasoning steps and failing to resolve the task within limits.
- Incomplete Utterance: Agents hallucinate API parameters instead of grounding calls in documentation.
Fine-Tuning Insights: Demonstrated that fine-tuning small models only on collaborative data improves performance on cooperative users but leaves them vulnerable to non-collaborative ones. Incorporating non-collaborative data into training significantly improves robustness.

4. Key Results

Performance Degradation: State-of-the-art agents (e.g., GPT-4.1-mini) showed significant drops in Success Rate (SR) when facing non-collaborative users.
- Tangential behavior caused the most severe drop (average ~29% decrease in SR).
- Unavailable Service led to high rates of redundant API calls and hallucinations.
Model Sensitivity:
- GPT-4.1-mini showed the highest robustness but still degraded under combined behaviors.
- Smaller models (e.g., GPT-4.1-nano) suffered catastrophic failures, often exceeding the 30-step reasoning limit due to user complaints or redundant loops.
- Qwen Models: Showed varied results; the 30B model avoided API repetition but the 235B model hallucinated API results more frequently.
Fine-Tuning Impact:
- Models trained only on collaborative data achieved >90% SR on cooperative users but dropped to <60% on non-collaborative users.
- Models trained with a mix of non-collaborative data achieved balanced robustness (~87% average SR), proving that exposure to these behaviors is necessary for deployment readiness.
Extensibility: The framework successfully adapted to non-tool benchmarks (ColBench) and collaborative task benchmarks (MINT), showing that the failure patterns are domain-agnostic.

5. Significance

This work fundamentally shifts the paradigm of evaluating tool agents. It argues that robustness to non-collaborative users is as critical as task completion capability.

Realism: It bridges the gap between static, cooperative benchmarks and the messy reality of human-AI interaction.
Diagnostic Tool: The framework serves as a "stress test" to identify specific weaknesses (e.g., hallucination vs. dialogue management) before deployment.
Training Strategy: It provides empirical evidence that training data must include non-collaborative scenarios to prevent agents from failing in production.
Open Source: The authors release an extensible simulation framework (NCUser) to help the community develop more resilient agents across diverse service domains.

In conclusion, the paper demonstrates that current LLM agents are fragile when faced with realistic human imperfections. By adopting non-collaborative user simulators, researchers can develop agents that are not only capable but also resilient, adaptable, and ready for real-world deployment.

Non-Collaborative User Simulators for Tool Agents

1. Problem Statement

2. Methodology

A. Taxonomy of Non-Collaborative Behaviors

B. Simulation Architecture

C. Experimental Setup

3. Key Contributions

4. Key Results

5. Significance

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models