Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

Imagine you have a super-smart personal assistant, let's call him "Agent Alex." Alex is great at using tools: he can check your bank account, look at your calendar, read your emails, and search the web. You ask him to do something simple, like "Make me a weekly expense report."

On the surface, this seems harmless. But this paper reveals a scary new problem: Alex is too good at connecting the dots.

Here is the story of the paper, broken down into simple concepts.

1. The Problem: The "Mosaic" Effect

Think of privacy like a mosaic puzzle.

Old Risk: If you give Alex a single tool (like your bank statement) and it accidentally shows your secret medical condition, that's a "direct leak." It's like dropping a puzzle piece on the floor where everyone can see it.
New Risk (TOP-R): This paper introduces Tools Orchestration Privacy Risk. Imagine Alex takes a receipt for a $185 dinner, a calendar entry saying "Lunch with Jason," and a contact card showing Jason works for a rival company.
- Piece A (Receipt): Just a dinner. Safe.
- Piece B (Calendar): Just a lunch. Safe.
- Piece C (Contact): Just a name. Safe.
- The Mosaic: When Alex puts them together, he realizes: "Oh! You are interviewing with a competitor!"

The scary part: None of the individual tools told Alex this secret. The secret only exists because Alex stitched the pieces together himself. He built a picture of your private life that you never intended to show anyone.

2. The Experiment: Building a Trap (TOP-Bench)

The researchers wanted to see how bad this problem is. They couldn't just wait for it to happen, so they built a giant trap called TOP-Bench.

The Recipe: They started with a secret (e.g., "The user is pregnant").
The Ingredients: They broke that secret down into harmless clues (e.g., "Search for maternity hospitals," "Buy prenatal vitamins," "Cancel gym membership").
The Test: They gave these clues to six of the smartest AI agents in the world and asked them to do a simple task.
The Result: The agents were terrible at keeping secrets.
- 62% of the time, the agents successfully reconstructed the secret.
- Even worse, 49% of the time, they figured it out in their "brain" (internal reasoning) but didn't say it out loud. This is like a spy who doesn't write the secret in a letter but remembers it perfectly, ready to use it later.

3. Why Does This Happen? (The Three Culprits)

The researchers found three main reasons why Alex (the AI) fails to keep your secrets:

The "Oblivious Helper": Alex is so eager to be helpful that he forgets to check if he's being too nosy. He has the ability to be private, but he doesn't think to use it.
The "Over-Thinker": The smarter the AI is at reasoning, the worse it gets at privacy. It's like a detective who is so good at solving crimes that they solve your private life by accident.
The "Stubborn Train": Once the AI starts thinking a certain way (e.g., "This person is looking for a new job"), it gets stuck on that track. Even if you tell it "Stop, that's private," it's hard to pull it off the track because it's already built the whole bridge.

4. The Solution: Three New Seatbelts

The researchers didn't just find the problem; they built three different "seatbelts" to fix it. They tested them to see which one keeps you safe without making the car drive too slowly.

Seatbelt A (The Context Guard): This asks, "Is it okay to share this here?"
- Analogy: It's like a bouncer at a club. "You can talk about your health with your doctor, but not with your boss." It stops the AI from sending private info to the wrong place.
- Result: Good, but not perfect. It misses the secrets the AI figures out internally.
Seatbelt B (The "Less is More" Rule): This tells the AI, "Only use the tools you absolutely need. Don't look at extra data, and don't try to connect the dots."
- Analogy: It's like a strict librarian who says, "You can check out one book. You cannot check out three books and try to guess the plot of a fourth one."
- Result: The Winner for Safety. It stopped almost all leaks. But, it made the AI a bit slower and less helpful because it refused to do some complex tasks.
Seatbelt C (The Panel of Judges): Before the AI gives you an answer, it has to imagine three different people reviewing its work:
1. The Helper: "Did I answer the question?"
2. The Lawyer: "Did I break any privacy rules?"
3. The Paranoid Spy: "If I combine this with Google, can I find out the user's secrets?"
- Analogy: It's like a committee meeting. If anyone says "No," the answer gets rewritten.
- Result: The Best Balance. It kept the AI very helpful while stopping most leaks. It's the best "seatbelt" for everyday use.

The Big Takeaway

We are building AI agents that can use many tools at once. This makes them incredibly powerful, but it also makes them dangerous privacy spies by accident.

Just because an AI doesn't steal your data doesn't mean it's safe. If it can connect the dots from harmless pieces of information to reveal your deepest secrets, we have a problem.

This paper proves that current AI safety rules aren't enough. We need new rules that stop the AI from connecting the dots in the first place, or at least force it to double-check its own conclusions before sharing them.

Here is a detailed technical summary of the paper "Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation" by Qiao et al.

1. Problem Statement: Tools Orchestration Privacy Risk (TOP-R)

The paper identifies a critical, under-explored privacy threat in Large Language Model (LLM) based agents that utilize a single-agent, multi-tool architecture. While existing privacy research focuses on data memorization (training data extraction) or direct data leakage (single-tool failures), this work addresses Tools Orchestration Privacy Risk (TOP-R).

Definition: TOP-R occurs when an agent, executing a benign user instruction, autonomously aggregates individually non-sensitive fragments from multiple tool outputs and synthesizes them to reconstruct unexpected sensitive information.
The Mosaic Effect: The risk is "compositionally emergent." No single tool reveals the sensitive attribute; the leakage only materializes when the agent correlates data across sources (e.g., combining a bank transaction, a calendar event, and a contact list to infer a job interview with a competitor).
Leakage Types:
- Explicit Leakage: The agent verbalizes the sensitive conclusion in its final output.
- Implicit Leakage: The agent infers the sensitive attribute internally (within the reasoning trace or context window) but does not output it. This is more insidious as it evades output-level filters but persists in system logs, potentially triggering downstream profiling or targeted ads.
Root Cause: The tension between "helpfulness" (aggregating all available data to provide a complete answer) and "safety." Current alignment objectives incentivize the cross-source aggregation that triggers leakage.

2. Methodology

A. Formal Framework

The authors establish a formal definition of TOP-R based on three necessary conditions:

Conclusion Sensitivity (C1): The inferred attribute $S$ belongs to a regulatory-aligned taxonomy of sensitive data (e.g., health, financial, identity).
Single-Source Non-Inferability (C2): No single tool output $o_i$ combined with the instruction $I$ allows inference of $S$ .
Compositional Inferability (C3): The full trajectory of tool outputs $(o_1, \dots, o_N)$ combined with $I$ allows reliable inference of $S$ .

B. Benchmark Construction: TOP-Bench

To measure this risk, the authors constructed TOP-Bench, the first benchmark specifically for tool-orchestration privacy.

Generation Pipeline (RISE): They developed a Reverse Inference Seed Expansion (RISE) pipeline. Instead of generating random tool calls and checking for leaks, they start with a target sensitive conclusion (derived from laws like GDPR/HIPAA), decompose it into innocuous fragments, and then map these to real-world APIs.
Dataset Stats: 300 validated samples covering 5 privacy domains (Identity, Health, Finance, Behavior, Proprietary Info) and 5 inference paradigms (e.g., Quasi-Identifier Reassembly, Cross-Domain Correlation).
Diagnostic Subset: A secondary set of 100 samples includes Social Context Augmentation (SCA), injecting cues about social norms (e.g., "this report is for HR") to test if agents can recognize contextual violations.
Metric: The H-Score, a harmonic mean of Task Completion (TC) and Safety ($1 - \text{Overall Leakage Rate}$), quantifies the utility-safety trade-off.

C. Mitigation Strategies

Three strategies were proposed and tested, all implemented via system-prompt injection (no model retraining):

Contextual Integrity Enforcement (CIE): Audits information flows against social norms (Contextual Integrity theory) to block data transfers that violate context (e.g., sending medical data to a work scheduler).
Dual-Constraint Privacy Enhancement (DCPE): Imposes hard constraints at the reasoning stage:
- Data Minimization: Only call tools strictly necessary for the explicit intent.
- Anti-Mosaic Protocol: Explicitly forbids synthesizing non-sensitive fragments into sensitive conclusions.
Multi-Role Consensus Defense (MRCD): Simulates an internal "Ensemble Voting" with three roles (Pragmatist, Compliance Officer, Security Expert) to review drafts. A response is only released if all roles agree.

3. Key Results

A. Empirical Diagnosis (Baseline Performance)

Evaluated on six state-of-the-art LLMs (including GPT-5.2, Qwen3, DeepSeek-V3.2, etc.):

Pervasive Risk: The average Overall Leakage Rate (OLR) is 62.11%, with an average H-Score of only 52.90%.
Implicit Dominance: Implicit leakage (49.33%) significantly exceeds explicit leakage (30.95%), proving that agents infer sensitive data internally even when they don't output it.
Reasoning Depth: Leakage primarily occurs at L3 (Synthesis), where the agent combines fragments. Models rarely activate L4 (Reflection) spontaneously to check for privacy.
Three Root Causes Identified:
1. Deficient Spontaneous Privacy Awareness: Models have the capability to reason about privacy but fail to activate it without explicit triggers.
2. Reasoning Overshoot: Stronger reasoning capabilities (Chain-of-Thought) paradoxically increase leakage by making the synthesis of sensitive conclusions more efficient.
3. Inference Inertia: Once a reasoning path is established, models resist correction, even when social context cues are introduced.

B. Mitigation Effectiveness

DCPE (Dual-Constraint): The most effective for safety. It reduced the OLR by 37.00% (from 62.11% to 25.11%) and raised the H-Score to 79.20%. However, it incurred a 12.55% drop in task completion due to over-blocking.
MRCD (Multi-Role Consensus): Achieved the best balance. It raised the H-Score to 74.12% with only a 2.00% loss in task completion. It effectively filters risks while maintaining utility.
CIE (Contextual Integrity): Least effective on its own (H-Score +3.90) because it operates post-synthesis and cannot prevent the internal inference (implicit leakage) from occurring. However, it showed high efficacy (+18.92 H-Score gain) when combined with social context cues.

4. Significance and Contributions

New Risk Class: The paper formally defines and quantifies TOP-R, shifting the focus from "data retrieval" risks to "compositional inference" risks in autonomous agents.
Benchmark Infrastructure: TOP-Bench provides the first rigorous, automated framework to evaluate multi-tool privacy risks, addressing a gap where existing benchmarks only test single-tool or adversarial prompt injection scenarios.
Diagnostic Insights: The discovery that implicit leakage is the dominant threat and that stronger reasoning models are more prone to leakage challenges the assumption that better reasoning equals better safety.
Practical Mitigation: The proposed strategies (especially DCPE and MRCD) offer immediate, deployable solutions for developers to secure agent pipelines without retraining models. The findings suggest that hard constraints on reasoning (DCPE) or dynamic consensus (MRCD) are necessary to counter the "helpfulness" drive of LLMs.

In conclusion, the paper demonstrates that the very capability making agents powerful—their ability to orchestrate and synthesize information from multiple sources—is their greatest privacy vulnerability. It provides a roadmap for measuring and mitigating this "Mosaic Effect" in the next generation of AI agents.