AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

Imagine you are a chef about to launch a new, spicy dish at your restaurant. Before you serve it to hundreds of real customers, you want to know: Will they like it? Will they order it? Or will they send it back?

In the digital world, this is called A/B Testing. Companies like Amazon, Netflix, and Microsoft constantly test two versions of a website (Version A vs. Version B) to see which one works better.

But here's the problem: Real A/B testing is slow, expensive, and risky.

Slow: You have to wait for real people to visit your site. If your site is new or niche, you might wait weeks for enough data.
Expensive: You need thousands of real humans to click around, which costs money and engineering time.
Risky: If the new design is terrible, you might annoy real customers before you even realize the mistake.

Enter "Agent A/B": The Digital Twin Restaurant

This paper introduces a new system called Agent A/B. Think of it as hiring a thousand digital twins (AI robots) to act as your customers before you open the doors to real people.

Here is how it works, using simple analogies:

1. The "Method Acting" Robots

The system doesn't just use random bots. It creates 1,000 unique AI agents, each with a specific "persona."

One agent is Marcus, a 35-year-old freelance graphic designer who loves tech gadgets and is budget-conscious.
Another is Sarah, a 60-year-old retiree who wants to buy a simple, easy-to-use blender.
Another is Leo, a 20-year-old student looking for the cheapest sneakers.

These agents aren't just clicking randomly. They have memories, goals, and personalities. They are "method actors" playing the role of real shoppers.

2. The "Parallel Universe" Simulation

The researchers set up two identical "parallel universes" of a website (specifically Amazon.com).

Universe A (Control): The website looks exactly as it does today, with a long, overwhelming list of filter options on the side.
Universe B (Treatment): The website has a new design where the filter list is shorter and smarter, showing only the most relevant options.

They then release their 1,000 AI agents into these universes. 500 agents go to Universe A, and 500 go to Universe B. The agents go about their "shopping day," searching for items, clicking filters, and trying to buy things.

3. The "Speed Run" vs. The "Slow Cook"

In a traditional test, you might wait three months to get enough real human data to make a decision.
With Agent A/B, you can run this entire experiment in hours. The AI agents simulate thousands of shopping trips instantly.

The Result?
The researchers found that the agents in the "Short Filter" universe (Universe B) actually bought more items than those in the "Long Filter" universe.

The Magic: When they compared the AI results to a real experiment they ran with 2 million actual humans on Amazon, the AI predictions were directionally correct. The AI got the "vibe" right: the shorter list was better.

Why This is a Game Changer

Think of Agent A/B as a flight simulator for website designers.

Before: A pilot (designer) had to fly a real plane (launch a website) to see if the new engine (feature) worked. If it failed, the plane might crash, and passengers (customers) would be unhappy.
Now: The pilot can fly the plane in a simulator with 1,000 virtual passengers. If the engine fails in the sim, they fix it instantly. No real passengers are ever put at risk.

The Bottom Line

This system isn't trying to replace real humans. Real humans are still the ultimate judges. Instead, Agent A/B is a safety net and a fast-forward button.

It allows companies to:

Test ideas early without waiting for real traffic.
Save money by catching bad designs before they go live.
Check for fairness by seeing how different "personas" (like older adults or tech novices) react to a design, ensuring the new feature works for everyone, not just the average user.

In short, Agent A/B lets you crash your website in a virtual world so you never have to crash it in the real one.

Here is a detailed technical summary of the paper "Agent A/B: Automated and Scalable A/B Testing on Live Websites with Interactive LLM Agents."

1. Problem Statement

Traditional A/B testing (online controlled experimentation) is the industry standard for validating UI/UX designs but suffers from significant bottlenecks:

Traffic Scarcity & Competition: Securing sufficient user traffic for statistically significant results is difficult, and experiments often compete for limited user segments, forcing serialization.
High Costs & Long Cycles: The process involves substantial engineering overhead, long development times, and slow feedback loops (often spanning months), delaying iteration.
Risk in Early Stages: Promising design ideas often go untested until late in the cycle because early-stage prototyping lacks rigorous, behaviorally grounded evaluation.
Limitations of Prior Simulation: Existing user behavior simulations (e.g., GOMS, ACT-R, or single-session LLM agents) are often labor-intensive, require domain expertise, or are confined to static/simulated environments rather than live, dynamic websites.

Goal: To develop a system that enables scalable, automated, and low-risk A/B testing on live websites using Large Language Model (LLM) agents before allocating real user traffic.

2. Methodology: Agent A/B System

The authors propose Agent A/B, an end-to-end system that deploys persona-driven LLM agents to interact with live webpages. The system is designed to be interoperable with existing agent stacks (e.g., ReAct, Claude Computer Use).

System Architecture & Pipeline

The system operates through four coordinated modules:

LLM Agent Generation:
- Generates a diverse population of agents based on user-specified demographic distributions (age, gender, income) and behavioral tendencies.
- Uses a "persona pool" approach where an LLM iteratively samples existing personas and demographic attributes to create new, stylistically consistent but diverse personas.
Testing Preparation:
- Assigns agents to Control (existing design) and Treatment (new design) groups.
- Ensures statistical balance across key persona attributes in both groups to minimize distributional skew.
Autonomous A/B Simulation:
- Interaction Loop: Agents interact with live web variants in isolated browser sessions using a Perceive–Decide–Act loop.
- Environment Parsing: A lightweight module extracts structured JSON representations of the webpage (e.g., product details, filters, prices) and defines the action space, filtering out visual noise and irrelevant HTML.
- LLM Agent: The agent reasons based on its persona, current task intention, and the structured observation to select the next action (e.g., Search, Click_Filter, Purchase, Stop).
- Action Execution: Translates LLM decisions into browser-level commands (via Selenium/ChromeDriver) with fault handling (retries, re-parsing) for dynamic page changes.
Post-Testing Analysis:
- Aggregates fine-grained interaction traces (actions, timestamps, page states).
- Computes metrics: completion rates, purchase rates, session duration, and spending.
- Supports stratified analysis by persona attributes to identify subgroup differences.

Case Study Implementation

Platform: Amazon.com.
Scenario: A/B testing a redesigned left-side filter panel.
- Control: Full filter list.
- Treatment: Reduced list using a similarity-based ranking algorithm (hiding options with <80% similarity to the query).
Scale: 1,000 LLM agents (500 per condition) running in parallel, compared against a parallel human A/B test with 2 million users.
Infrastructure: Distributed cluster of 16 nodes using headless Chrome instances; LLM backend: Claude 3.5 Sonnet.

3. Key Contributions

Agent A/B System: A novel, end-to-end framework for scalable, persona-driven A/B testing on live websites using interactive LLM agents.
Empirical Validation: Evidence from an Amazon case study demonstrating that LLM agent simulations align directionally with large-scale human A/B test results.
Design Implications: A framework for using agent-based simulation to support early prototyping, pre-deployment validation, and hypothesis-driven UX evaluation, particularly for underrepresented user groups.

4. Key Results

Alignment with Human Behavior:
- While LLM agents exhibited more goal-directed behavior (fewer exploratory actions) compared to humans, they achieved comparable purchase rates and similar filter usage patterns.
- The system successfully detected the same directional outcome as the human experiment: the reduced filter list led to more purchases.
Statistical Significance:
- In the agent simulation, the treatment group (reduced filters) resulted in significantly more purchases than the control group (414 vs. 403; $\chi^2(1) = 5.51, p < 0.05$ ).
- Average spending showed an upward trend in the treatment group ($60.99 vs. $55.14), though not statistically significant in the agent sample.
Subgroup Detection:
- The system successfully identified heterogeneous responses across personas. For instance, agents representing older and male customers showed larger spending increases under the simplified design, while younger agents showed decreased spending. These patterns mirrored trends found in the human data.
Cost Efficiency:
- Simulating 1,000 agents cost approximately $2,925 (token costs) and generated 35kg of CO2.
- Recruiting 1,000 human participants for a similar study would cost approximately $100,000.

5. Significance and Future Outlook

Complementary Tool: Agent A/B is positioned not as a replacement for human testing, but as a complementary layer in the design lifecycle. It allows for "risk-free" piloting, enabling designers to iterate rapidly and filter out poor designs before consuming real user traffic.
Inclusive Piloting: The system allows for the simulation of specific, hard-to-recruit demographics (e.g., older adults, low digital literacy users) to ensure designs do not disproportionately harm these groups.
Scalability: It enables the evaluation of thousands of design variants and user segments simultaneously, addressing the "traffic scarcity" bottleneck of traditional A/B testing.
Future Work: The authors envision enhancing agent fidelity, broadening domain coverage beyond e-commerce, and integrating these simulations into intelligent design optimization workflows.

In summary, Agent A/B bridges the gap between theoretical user simulation and practical, large-scale web experimentation, offering a faster, cheaper, and safer method to validate UI/UX decisions prior to live deployment.