AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows

Imagine you hire a super-smart, hyper-efficient personal assistant named "Agent." You tell this Agent, "Please help me organize my schedule and email my boss about my sick day."

In the past, we only worried about the final letter the Agent wrote. We asked: Did the final email accidentally reveal my bank password or my medical diagnosis? If the final letter looked clean, we thought, "Great, privacy is safe!"

This paper argues that we are looking at the wrong thing.

The authors say that privacy isn't just about the final letter; it's about the entire journey the information takes. They call this new approach AgentSCOPE.

Here is the breakdown using simple analogies:

1. The Problem: The "Hidden Middle"

Think of your Agent as a courier service.

The Old Way: We only checked the package when it arrived at the recipient's house. If the package was sealed and looked normal, we assumed everything was fine.
The New Reality: The Agent doesn't just write a letter. It goes into your digital house, opens your calendar, reads your emails, asks your contacts for numbers, and then writes the letter.
The Danger: Even if the final letter is perfect, the Agent might have:
- Read your diary while looking for your calendar (Over-reading).
- Asked your calendar app for every appointment, including your sensitive fertility treatment, just to find one meeting (Over-asking).
- Let the calendar app dump a whole bunch of private data into its brain before it even started writing (Over-receiving).

The paper says: Just because the final result is clean doesn't mean the Agent didn't snoop around your house in the meantime.

2. The Solution: The "Privacy Flow Graph" (The Detective's Map)

To fix this, the authors created a tool called the Privacy Flow Graph.

Imagine a detective's corkboard with red string connecting different events.

The Nodes: The User (You), the Agent, the Tools (Calendar, Email), and the Recipient.
The Strings: Every time information moves from one to another, it gets a tag.
The Rule: They use a concept called "Contextual Integrity." This is like a bouncer at a club.
- Scenario: You tell the Agent, "Tell my boss I'm sick."
- The Bouncer's Job: When the Agent asks the Calendar, "What meetings do I have?", the Bouncer checks: "Is it okay for the Calendar to tell the Agent about your fertility consultation?"
- The Answer: No! That data is irrelevant to the task. Even if the Agent doesn't put it in the final email, the fact that the Calendar gave it to the Agent is a privacy violation.

The Privacy Flow Graph traces every single step to see if the "Bouncer" let the wrong data through, even if that data never made it to the final output.

3. The Experiment: AgentSCOPE

The authors built a test called AgentSCOPE.

They created a fictional character named Emma.
They gave her Agent access to her email, calendar, and files, filling them with a mix of boring stuff (meeting times) and sensitive stuff (medical records, legal issues).
They asked the Agent to do 62 different tasks (like "Email my manager I'm sick" or "Find my flight details").
They tested 7 of the smartest AI models available (like GPT-4o and Claude).

4. The Shocking Results

The results were a wake-up call:

The "Clean" Illusion: When they only looked at the final email, the AI models seemed pretty good. About 76% to 80% of the time, the final email didn't leak secrets.
The "Messy" Reality: When they looked at the whole journey (using their Privacy Flow Graph), they found that 80% to 94% of the tasks involved privacy violations somewhere along the way.
Where did the leaks happen?
- The Tools: Often, the tools (like the Calendar app) were too helpful. They gave the Agent too much information, including sensitive details the Agent didn't need.
- The Agent: Sometimes the Agent asked for too much data, or the user gave too much info in the first prompt.

The Big Takeaway

"Output-only" evaluation is like checking a car only for scratches on the bumper. You might miss the fact that the engine is smoking, the brakes are failing, or the driver is speeding.

The paper concludes that we cannot just trust the final answer. We need to monitor the entire pipeline. If an AI system is going to handle our personal lives, we need to ensure that:

It doesn't ask for data it doesn't need.
The tools it uses don't dump private data into its lap.
It doesn't hold onto sensitive info just because it can.

In short: Just because the Agent delivered the package safely doesn't mean it didn't steal your mail while it was sorting it. We need to watch the whole process, not just the end result.

Here is a detailed technical summary of the paper "AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows."

1. Problem Statement

Agentic AI systems are evolving from passive text generators into autonomous actors that interact with users' personal data (emails, calendars, files) to complete complex, multi-step tasks. Current privacy evaluation methodologies suffer from a critical blind spot:

Output-Centric Limitation: Existing benchmarks (e.g., PrivacyLens, PrivacyChecker) primarily evaluate privacy at the final output boundary. They check if sensitive data appears in the agent's final response.
The Hidden Risk: This approach ignores intermediate information flows. An agent may successfully complete a task with a "clean" final output, yet violate privacy norms during intermediate stages (e.g., querying a tool for unnecessary data, receiving sensitive data from an API, or temporarily exposing data during reasoning).
Lack of Granularity: Developers and regulators cannot distinguish whether privacy was preserved by design or merely by accident because intermediate stages are not monitored or attributed.

2. Methodology

The authors propose a holistic framework to evaluate privacy across the entire agentic pipeline, grounded in Contextual Integrity (CI) theory.

A. The Privacy Flow Graph (PFG)

The PFG is a novel framework that operationalizes Contextual Integrity by modeling an agentic workflow as a directed graph of information transfers between four actors: User, Agent, External Tools, and Downstream Recipients.

Structure: Each edge represents a concrete transmission event (e.g., User $\to$ Agent prompt, Agent $\to$ Tool query, Tool $\to$ Agent response, Agent $\to$ Final Output).
Annotation: Every edge is annotated with the five CI parameters:
1. Sender
2. Recipient
3. Subject (the person the data is about)
4. Data Type
5. Transmission Principle (the conditions under which data is shared)
Differentiation: The PFG distinguishes between Essential Information (strictly necessary for the task) and Non-Essential Sensitive Information (extraneous personal data retrieved or inferred).
Function: It enables end-to-end traceability, allowing evaluators to pinpoint the exact origin of a violation (e.g., did the user overshare, did the agent over-query, or did the tool over-return data?).

B. AgentSCOPE Benchmark

To support the PFG, the authors introduce AgentSCOPE, a benchmark designed for full-pipeline evaluation.

Scope: 62 multi-step scenarios centered on a fictional user ("Emma") and her agentic assistant.
Domains: Covers 8 regulatory domains, including medical, financial, legal, employment, and reproductive health.
Ground Truth: Unlike previous benchmarks that use pre-constructed trajectories, AgentSCOPE allows agents to generate their own execution paths. Crucially, it provides ground truth annotations at every pipeline stage, defining exactly which data items are appropriate or inappropriate for each specific flow.
Environment: Agents interact with populated email, calendar, contact, and file systems containing a mix of routine and sensitive data.

C. Evaluation Metrics & Judges

The study evaluates seven state-of-the-art LLMs (from OpenAI and Anthropic) using two evaluation methods:

Keyword Matching (Baseline): Checks for the presence of pre-defined sensitive keywords.
LLM-as-a-Judge (CI-Grounded): A more sophisticated judge that evaluates each information flow using the specific CI parameters and ground truth context for that boundary.

Key Metrics:

Task Success Rate (TSR): Percentage of tasks completed successfully.
Leak Rate (LR): Violations detected only at the final output.
Pipeline Violation Rate (PVR): Violations detected at any intermediate stage (instruction, query, response, or output).
Violation Origin Rate (VOR): Traces final output leaks back to their upstream cause.

3. Key Contributions

Conceptual Shift: Argues that privacy evaluation must treat every boundary in an agentic pipeline as a potential site of violation, not just the final output.
Privacy Flow Graph (PFG): Introduces a structured, traceable framework to decompose agentic execution into annotated information flows, making invisible intermediate risks visible.
AgentSCOPE Benchmark: Provides the first benchmark with per-stage ground truth for multi-tool agentic workflows, enabling the measurement of privacy risks throughout the entire lifecycle of a task.
Empirical Evidence: Demonstrates that current agentic systems have high privacy failure rates that are completely masked by output-only evaluation.

4. Results

The evaluation of seven models (including GPT-4o-mini, GPT-4.1, GPT-5, Claude Haiku, Opus, and Sonnet) yielded significant findings:

The "Clean Output" Illusion: While models achieved high Task Success Rates (63–79%), their privacy performance was poor when viewed through the full pipeline.
- Output-Only (Leak Rate): Violations appeared moderate (24–40%).
- Full-Pipeline (Pipeline Violation Rate): Violations skyrocketed to 82–94%.
- Conclusion: Output-level evaluation drastically underestimates privacy risk.
Violation Origins:
- Most violations occur at the Instruction (user oversharing) and Response (tools returning excessive data) stages.
- Tools often return non-essential sensitive data (e.g., fertility calendar events) even when the agent only requested meeting times.
- The Query stage also contributes, where agents make over-broad tool calls.
Trade-off: There is a tension between utility and privacy; the model with the highest success rate (GPT-4o-mini, 79%) also had the highest leak rate (40%).
Evaluation Method Discrepancy: Keyword-based baselines significantly underestimated violations compared to the CI-grounded LLM judge (e.g., GPT-4.1 showed 61% PVR via keywords vs. 92% via LLM judge).

5. Significance

Paradigm Shift in Privacy Engineering: The paper establishes that "privacy by accident" is insufficient. As agentic systems gain unrestricted access to personal data, privacy must be co-optimized with utility at every stage of the workflow.
Actionable Traceability: The PFG moves privacy evaluation from a binary "leak/no-leak" check to a diagnostic tool that identifies where and why a violation occurred (e.g., distinguishing between a user error and a tool API flaw).
Regulatory & Development Impact: The findings suggest that current regulatory compliance checks based on final outputs are inadequate. Developers need pipeline-level monitoring to ensure contextual integrity is maintained throughout the agent's reasoning and tool-use processes.
Future Directions: The authors propose moving from offline evaluation to online deployment, where the PFG is built in real-time to enable active privacy mitigation and intervention during execution.