On the Suitability of LLM-Driven Agents for Dark Pattern Audits

Imagine you are trying to return a defective toaster to a store. You know you have the right to do it, but the store has hidden the return policy in a tiny font, made you walk through three different departments to find the form, and then asked you to fill out a 50-page questionnaire just to prove you own the toaster.

This is what Dark Patterns are: tricky website designs that trick, confuse, or bully you into doing something you didn't want to do, or stop you from doing something you have a legal right to do.

For a long time, finding these tricks has been like looking for a needle in a haystack. Researchers had to manually visit thousands of websites, click through every button, and take notes. It was slow, expensive, and hard to repeat.

The Big Question:
Can we build a "digital robot" (an AI agent) to do this detective work for us? Can an AI navigate these tricky websites, spot the tricks, and write a report, just like a human would?

This paper says: Yes, but with some important caveats.

Here is the breakdown of their experiment, explained with some everyday analogies.

1. The Test Drive: The "Toaster Return" Simulation

The researchers chose a specific, high-stakes scenario to test their robot: Data Broker Websites.

The Context: Under California law (CCPA), you have the right to ask these companies to delete your personal data.
The Problem: These companies often make it incredibly hard to find the "Delete My Data" button.
The Mission: They built an AI agent (a robot browser) and sent it to 456 different data broker websites. The robot's job was to act like a human: find the "Right to Access" page, click through the forms, and identify if the site was using "Dark Patterns" to make the process painful.

2. Teaching the Robot: The "Training Manual" Analogy

They didn't just tell the robot, "Go find bad designs." They tried different ways to teach it, like training a new employee:

Level 1 (The Blank Slate): They just gave the robot the instructions. Result: It got confused and made mistakes.
Level 2 (The Role Play): They told the robot, "You are a strict privacy auditor." Result: It became too sensitive, flagging normal things as bad (like crying wolf).
Level 3 (The Example Book): They gave the robot a book of real examples showing exactly what a "bad design" looks like. Result: Much better! It learned to distinguish between a real trick and a normal button.
Level 4 (The Think-Step-by-Step): They added a rule: "Before you decide it's a trick, write down your reasoning step-by-step." Result: The Best. This combination of examples + thinking aloud made the robot the most accurate.

3. The Results: What Did the Robot Find?

When they let the best version of the robot loose on the remaining 356 websites, here is what happened:

Success Rate: The robot successfully completed the "mission" on about 80% of the websites. It could navigate the maze, find the forms, and spot the tricks.
The Most Common Trick: The most frequent dark pattern found was "Creating Barriers."
- Analogy: Imagine trying to return a toaster, and the store suddenly says, "Oh, you can't return it unless you also buy a warranty and fill out a survey about your favorite color."
- The robot found that about half of the websites forced users to do unnecessary, annoying things just to exercise their rights.
The "Hidden" Tricks: The robot was great at spotting obvious tricks (like a button that is too small to click). But it struggled with "Privacy Mazes."
- Analogy: If a store hides the return policy in three different rooms and you have to remember what you saw in Room 1 to understand Room 3, the robot sometimes forgot the details. It got lost in the long, winding path.

4. Where the Robot Stumbles (The Limitations)

The robot isn't perfect yet. It failed in three main ways:

The "Security Guard" Problem: Many websites have CAPTCHAs (those "click all the traffic lights" puzzles) or bot-detection systems. The robot, being a bot, got stopped at the door. It couldn't trick the security guard, so it couldn't get inside to audit the store.
The "Memory" Problem: If a website makes you click through 10 different pages to find the answer, the robot sometimes forgot what it saw on Page 1 by the time it got to Page 10. It's like trying to solve a puzzle while someone keeps erasing the pieces you've already placed.
The "Judgment" Problem: Sometimes, a website asks for your ID. Is that a security measure (good) or a trick to stop you (bad)? The robot struggled to tell the difference. It needs a human to help decide if a rule is "reasonable" or "too harsh."

The Bottom Line

Can AI audit dark patterns?
Yes. It is a powerful tool that can scan hundreds of websites in the time it takes a human to scan one. It is excellent at spotting obvious tricks and gathering evidence.

Should we trust it 100%?
No. It still gets stuck on security walls, forgets long stories, and sometimes can't tell the difference between a security guard and a bully.

The Future:
Think of this AI not as a replacement for human auditors, but as a super-efficient intern. It can do the boring, repetitive work of visiting thousands of sites and flagging the obvious problems. Then, a human expert can step in to review the tricky cases, make the final judgment calls, and ensure justice is served.

This paper proves that while we aren't quite ready to let the robots run the show alone, they are ready to help us clean up the internet, one tricky website at a time.

Here is a detailed technical summary of the paper "On the Suitability of LLM-Driven Agents for Dark Pattern Audits."

1. Problem Statement

The paper addresses the challenge of scaling dark pattern audits (identifying manipulative interface designs that subvert user decision-making). While dark patterns are well-documented, current auditing methods rely heavily on manual human review, which is labor-intensive, difficult to reproduce, and hard to scale across heterogeneous web interfaces.

The authors investigate whether LLM-driven autonomous agents can reliably replace or augment human auditors in complex, multi-step workflows. Specifically, they focus on CCPA (California Consumer Privacy Act) data rights request portals. These portals are critical because they operationalize statutory rights (access, deletion, opt-out), yet their design can be structured to burden or discourage users. The core research questions are:

Can LLM agents reliably traverse these workflows and classify dark patterns against a human-annotated ground truth?
What are the specific limitations (execution, observation, reasoning) that constrain their effectiveness?

2. Methodology

The study employs a three-phase experimental design involving 456 data broker websites registered with the California Privacy Protection Agency (CPPA).

Phase 1: Ground Truth Construction

Dataset: A subset of 100 data brokers was manually audited by human annotators.
Protocol: Annotators followed a standardized "Right-to-Know" workflow, documenting interaction traces, screenshots, and navigation paths without submitting final requests.
Taxonomy: Using the ontology by Gray et al. [16], annotators identified 14 dark pattern categories. After filtering for statistical significance (frequency >20), 8 high-frequency categories were retained for evaluation (e.g., Adding Steps, Creating Barriers, Privacy Mazes, Hidden Info).
Reliability: Inter-annotator agreement was calibrated ( $\kappa$ increased from 51.9% to 73.7% after discussion).

Phase 2: Agent Design & Prompt Ablation

Agent Architecture: An LLM-driven browser agent built on the browser-use framework (v0.9.5) using GPT-5. It interacts with live sites via Playwright, capturing DOM structures, screenshots, and internal reasoning traces.
Prompting Strategies: The authors tested four configurations to optimize detection:
1. Zero-shot: Basic task description.
2. Zero-shot + Role: Framing the agent as a regulatory auditor.
3. Few-shot + Role: Providing canonical examples of dark patterns from the ground truth.
4. Few-shot + Role + Chain-of-Thought (CoT): Adding structured reasoning steps to the examples.
Metrics: Classification accuracy, precision, recall, F1-score, and explanation accuracy (verifying if the agent's rationale matches the evidence).
Failure Analysis: A structured categorization of failures (e.g., Security barriers, Navigation failure, Interaction failure) was implemented to distinguish between "failed audits" and "successful audits with errors."

Phase 3: Large-Scale Deployment

The best-performing configuration (Level 4) was deployed across the remaining 356 data brokers to estimate the prevalence of dark patterns in the wild.

3. Key Contributions

Empirical Evaluation of Agentic Auditing: The first systematic study evaluating LLM agents specifically for interaction-level dark pattern auditing (traversing multi-step workflows) rather than static UI analysis.
Methodological Framework: A rigorous protocol for constructing ground truth in dynamic workflows and a structured failure analysis framework distinguishing between execution, observation, and reasoning errors.
Prompt Engineering Insights: Demonstrating that Few-shot prompting with Role conditioning and Chain-of-Thought reasoning significantly outperforms zero-shot approaches, particularly in reducing false positives.
Prevalence Data: Providing the first large-scale estimates of dark patterns in CCPA data rights workflows.

4. Key Results

A. Agent Performance (RQ1)

Best Configuration: The Few-shot + Role + CoT strategy achieved the highest performance:
- Classification Accuracy: 86.7%
- F1-Score: 80.7%
- Explanation Accuracy: 98.5% (indicating the agent provides strong evidence when it makes a correct classification).
Pattern-Specific Performance:
- High Performance: Patterns with localized, observable signals (e.g., Adding Steps, Visual Prominence) were detected with >90% F1-scores.
- Low Performance: Patterns requiring multi-step context aggregation (e.g., Privacy Mazes, Hidden Info) had lower F1-scores (66.7% and 54.8%, respectively), primarily due to low recall.
Workflow Completion: The agent successfully completed 87% of workflows in the controlled phase and 79% in the large-scale deployment.

B. Constraints and Failure Modes (RQ2)

The study identified three primary categories of constraints:

Execution Limits (52.5% of failures):
- Security Barriers (25.8%): CAPTCHAs and bot-detection systems blocked the agent.
- Automation Instability (26.7%): Network timeouts and browser crashes.
- Implication: Coverage is limited by external infrastructure, not just the LLM's reasoning.
Observation Limits:
- Agents struggle with visually subtle patterns (e.g., hidden links) if the UI state isn't fully expanded.
- Context Window Limitations: Agents fail to aggregate fragmented signals across long workflows (e.g., comparing instructions on Page 1 with actions on Page 5), leading to missed "Privacy Mazes."
Reasoning Limits:
- Interpretive Ambiguity: Agents struggle with "proportionality" judgments (e.g., is a specific ID requirement a security measure or a barrier?).
- Signal Aggregation: Difficulty in combining weak signals from multiple steps to form a coherent dark pattern diagnosis.

C. Prevalence Findings

Creating Barriers is the most prevalent dark pattern, found in roughly 50% of completed workflows (e.g., demanding excessive ID or forcing app downloads).
Ambiguity and Fragmentation patterns (e.g., conflicting info, privacy mazes) appear in 20–35% of workflows.

5. Significance and Implications

Feasibility of Automated Auditing: LLM agents are feasible for scalable dark pattern auditing but are not yet a "drop-in" replacement for humans. They function best as triage tools that flag high-confidence violations for human review.
Regulatory Impact: The study proves that statutory rights (like CCPA) are frequently undermined by interface design. Automated auditing can provide the consistent, evidence-backed data needed for regulatory enforcement.
Technical Direction: Future agent designs must move beyond simple ReAct (Reasoning + Action) loops. They require:
- Structured Memory: To maintain context across long workflows.
- Explicit State Management: To handle branching paths and conditional forms.
- Human-in-the-Loop: To adjudicate borderline cases involving normative judgments (proportionality).
Ethical Boundaries: The authors emphasize a principled approach, avoiding evasive crawling techniques (like proxy rotation) to respect website safeguards, even if it limits coverage.

In conclusion, the paper establishes that while LLM-driven agents can effectively audit dark patterns in complex regulatory workflows, their reliability is currently bounded by execution stability and the ability to aggregate distributed contextual signals.