Designing Service Systems from Textual Evidence

Imagine you are the manager of a massive, busy call center. You have five different ways to handle customer complaints (different scripts, different routing rules, different AI tools). Your goal is to pick the one best method that makes customers happiest.

In the old days, you would just look at a simple number: "How many calls were resolved?" But in the modern world, the proof of quality isn't a number; it's text. It's the chat logs, the angry emails, the detailed complaint stories. Reading thousands of these stories to find the best method is impossible for humans—it would take forever and cost a fortune.

Enter AI (Large Language Models). You can ask an AI to read these stories and give each method a score. It's fast and cheap! But here's the catch: The AI is biased. It might love long, wordy answers even if they are wrong, or it might hate short, direct answers even if they are perfect. If you just trust the AI, you might pick the worst method because the AI "liked" it.

So, you need Human Experts to double-check. But humans are slow and expensive. You can't ask them to read every single story.

This paper solves a puzzle: How do you find the best method using mostly cheap, biased AI scores, but only paying for expensive human checks when absolutely necessary?

Here is the solution, explained through a simple analogy.

The Analogy: The "Smart Scout" and the "Strict Judge"

Imagine you are a general trying to find the best route for your army.

The AI (The Scout): Runs ahead and looks at the terrain. It's super fast and cheap. But the Scout is a bit crazy; sometimes it thinks a muddy swamp is a highway because it likes the color blue.
The Human (The Judge): Is slow, expensive, and always tells the truth.
The Problem: If you only listen to the Scout, you might march your army into a swamp. If you ask the Judge to check every single path, you run out of money before you find the best route.

The Paper's Solution (PP-LUCB):
The authors created a smart system that acts like a Super-Strategist. Here is how it works:

The Scout Runs First: The system asks the AI (Scout) to evaluate every single path. It gets a score for everything.
The "Bias Detector": The system knows the Scout is crazy. It doesn't just trust the score. It starts asking the Judge (Human) to check a few paths to see how crazy the Scout is.
- Example: The system notices, "Hey, every time the Scout sees a 'blue' path, it gives it a 10/10, but the Judge says it's a 2/10."
The "Smart Audit" (The Magic Trick): This is the most important part. The system does not ask the Judge to check random paths.
- If the Scout says a path is "perfectly clear" (and the system is confident the Scout is right), it skips the human check.
- If the Scout is confused or if the path looks "weird" (where the Scout's bias might be strongest), the system immediately sends the Judge to check it.
- Think of it like a security guard: You don't stop every single person walking into a building. You only stop the people who look suspicious or are acting weird. The "suspicious" ones are the ones where the AI's judgment is unreliable.
The "Correction": Once the Judge checks a few "suspicious" paths, the system uses math to fix all the other scores. It says, "Okay, the Scout is usually 2 points too high on blue paths, so let's subtract 2 from all the blue paths."
The Winner: The system keeps doing this—checking the AI, asking humans only when confused, and correcting the scores—until it is 99% sure which route is the best.

Why is this a Big Deal?

It Saves Money: In their tests, this method found the best solution while cutting human review costs by 90%. They did almost all the work with the cheap AI and only used humans for the critical "tie-breakers."
It's Safe: Even though the AI is biased, the math guarantees that the final decision is correct. It's like having a safety net that catches you if the AI falls off a cliff.
It Handles Delays: Sometimes, the human Judge takes a day to reply. The system is smart enough to keep working and making decisions even while waiting for the Judge's answer, without getting confused.

The Bottom Line

This paper teaches us how to use AI as a first draft and Humans as the final editor, but with a twist: we only hire the editor when the AI is clearly struggling.

Instead of paying a human to read 1,000 stories to find the best one, this method pays a human to read maybe 100 stories, uses math to fix the AI's mistakes on the other 900, and still finds the winner with total confidence. It's the ultimate "work smarter, not harder" strategy for the age of AI.

Here is a detailed technical summary of the paper "Designing Service Systems from Textual Evidence" by Ao et al.

1. Problem Formulation

The paper addresses the challenge of Best Arm Identification (BAI) in service system design where performance evidence is primarily textual (e.g., customer support transcripts, complaint narratives, compliance reports) rather than scalar metrics.

Context: Managers must select the optimal service configuration (e.g., routing policies, staffing rules, LLM prompts) from a set of $K$ alternatives.
The Data Gap: Traditional optimization relies on scalar performance samples. Here, evidence is unstructured text, making direct aggregation impossible.
The LLM-as-a-Judge Solution: Large Language Models (LLMs) can convert text trajectories into standardized proxy scores ( $F$ ). However, these scores are cheap but biased. The bias is often arm-dependent (systematic differences in how the LLM rates different configurations) and instance-dependent.
The Human Audit: Human experts provide verified, unbiased outcomes ( $Y$ ) but at a high cost and potentially with delays.
The Core Challenge: How to identify the configuration with the highest expected verified outcome ( $\theta_k = E[Y|k]$ ) with high confidence ($1-\delta$) while minimizing the total cost of expensive human audits, given that the cheap proxy scores are systematically biased?

2. Methodology

The authors propose a framework combining Prediction-Powered Inference (PPI) with Sequential Decision Making under a Fixed-Confidence setting.

A. Statistical Estimation: IPW Residual Correction

The core statistical insight is decomposing the true mean $\theta_k$ into the proxy mean and a residual gap:
$\theta_k = E[F|k] + E[Y - F|k]$

Proxy Mean ( $\hat{\mu}_{F,k}$ ): Estimated via simple averaging of all cheap proxy scores.
Residual Mean ( $\hat{\mu}_{R,k}$ ): Estimated using Inverse Propensity Weighting (IPW). Since human audits are selected adaptively (based on the proxy score and instance), a naive average of audited outcomes is biased. The authors use IPW to re-weight audited residuals ( $Y-F$ ) by the inverse of the audit probability ( $\pi_t$ ), ensuring an unbiased estimate of the residual gap even under adaptive auditing.
$\hat{\theta}_k(t) = \hat{\mu}_{F,k}(t) + \frac{1}{N_k(t)} \sum_{s \le t, k_s=k} \frac{A_s}{\pi_s}(Y_s - F_s)$

B. Anytime-Valid Confidence Sequences (CS)

To support optional stopping (stopping as soon as the best arm is identified), the authors construct confidence sequences that remain valid simultaneously for all time steps $t$ , under adaptive sampling and auditing.

They utilize stitched boundaries (based on Howard et al., 2021) for sub-Gaussian martingales.
Separate CS are constructed for the proxy mean and the IPW-corrected residual mean, then combined to form a valid CS for $\theta_k$ .
This ensures that the probability of the true mean ever falling outside the confidence interval is bounded by $\delta$ , regardless of when the algorithm stops.

C. The PP-LUCB Algorithm

The authors propose PP-LUCB (Prediction-Powered Lower and Upper Confidence Bound), an algorithm that jointly decides:

Which arm to pull: Uses a standard LUCB strategy (pulling the current leader and the strongest challenger).
Whether to audit: Uses a Neyman-style allocation rule. The audit probability $\pi_t$ $π_{t}$ is proportional to the estimated variance of the residual ( $\sqrt{g_k(x,f)}$ $g_{k} (x, f)$ ).
- Intuition: Audits are concentrated on instances where the LLM judge is least reliable (high residual variance), rather than uniformly distributed. This minimizes the variance of the debiased estimator.

D. Handling Delayed Feedback

The framework is extended to handle delayed audit feedback (common in human review queues).

The algorithm continues pulling arms and collecting proxy scores while waiting for audits to return.
The confidence sequences are adjusted to account for "pending" audits (audits requested but not yet returned), ensuring validity is maintained without breaking the stopping condition.

3. Key Contributions

Theoretical Proof of Failure Modes:
- Proved that proxy-only selection fails under arm-dependent bias (Theorem 3.5).
- Proved that naive selective auditing (ignoring selection bias) leads to asymptotic bias even with infinite audits.
Novel Estimator & Algorithm:
- Developed a Prediction-Powered Estimator using IPW to correct for selective auditing.
- Proposed PP-LUCB, which achieves $\delta$ -correctness and near-optimal cost efficiency by concentrating audits on high-uncertainty regions.
Optimality Bounds:
- Derived instance-dependent upper bounds on cost, showing the algorithm scales with $\log(1/\delta)/\Delta^2$ .
- Established an information-theoretic lower bound and introduced a tracking variant (PP-Track-and-Audit) that is asymptotically optimal as $\delta \to 0$ .
Delayed Feedback Extension:
- Provided rigorous proofs that statistical validity and $\delta$ -correctness are preserved even when audit outcomes arrive with arbitrary delays, quantifying the impact on decision latency.

4. Experimental Results

The authors evaluated the framework on synthetic data and three real-world service system case studies using live LLM APIs.

Synthetic Benchmarks:
- Cost Reduction: The Neyman-style auditing policy reduced total evaluation costs by 48–50% compared to uniform auditing while maintaining the same identification accuracy.
- Validity: Confidence sequences achieved the desired time-uniform coverage (e.g., 99.8% coverage for $\delta=0.01$ ).
Real-World Case Studies:
1. MT-Bench (LLM Selection): Selecting the best LLM from 6 models. PP-LUCB correctly identified the best model in 40/40 trials (in a specific setup) while reducing audit costs by 90% compared to full human review.
2. Support Ticket Classification: Selecting the best combination of LLM and prompt strategy. The algorithm successfully identified the cost-optimal configuration (high accuracy, low token usage) in all trials.
3. Queue-Based Service Design: Selecting composite configurations (Routing Policy + Prompt + Model). The algorithm effectively separated design classes, identifying that routing policy was the dominant factor for SLA compliance, while prompt engineering had negligible impact in that specific context.
Delayed Feedback:
- In experiments with bounded, geometric, and heavy-tailed delay distributions, PP-LUCB maintained valid coverage and correctly identified the best arm.
- Delays increased the time to decision (stopping time) but did not affect the monetary cost or the correctness of the final selection.

5. Significance and Implications

Bridging Operations and AI: The paper provides a rigorous statistical framework for integrating LLMs into service operations, moving beyond "black box" evaluation to statistically sound, cost-aware decision-making.
Cost-Efficiency: It demonstrates that human expert review does not need to be exhaustive. By intelligently auditing only where the AI is uncertain, organizations can achieve high-confidence decisions with a fraction of the audit budget (up to 90% reduction).
Robustness to Bias: The methodology explicitly handles the systematic biases inherent in LLM judges, a critical issue often overlooked in current "LLM-as-a-judge" literature.
Practical Guidelines: The paper offers actionable advice for managers, such as logging audit probabilities for IPW correction, maintaining minimum audit rates across segments, and managing audit queues to minimize decision latency.

In summary, this work establishes a new paradigm for service system design with textual evidence, proving that combining cheap, biased AI proxies with selective, expensive human audits via principled statistical correction yields optimal, cost-effective, and reliable system configurations.

Designing Service Systems from Textual Evidence

The Analogy: The "Smart Scout" and the "Strict Judge"

Why is this a Big Deal?

The Bottom Line

1. Problem Formulation

2. Methodology

A. Statistical Estimation: IPW Residual Correction

B. Anytime-Valid Confidence Sequences (CS)

C. The PP-LUCB Algorithm

D. Handling Delayed Feedback

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

Missingness Bias Calibration in Feature Attribution Explanations

Why Is RLHF Alignment Shallow? A Gradient Analysis

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning