SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

The paper proposes SAGE, a novel user simulation framework for evaluating multi-turn agents that integrates top-down business logic and bottom-up infrastructure knowledge to generate realistic, diverse interactions capable of identifying significantly more agent errors than existing methods.

Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are building a new, high-tech robot barista. Before you open your shop to the real world, you need to test it. You want to make sure it can handle rude customers, complex orders, and weird questions without crashing or spilling coffee everywhere.

In the past, developers had two main ways to test their agents (AI chatbots):

  1. Hire real humans: Expensive, slow, and hard to scale.
  2. Use a generic computer program: Fast, but these programs act like "robots pretending to be robots." They ask boring, repetitive questions and don't behave like real people with real needs.

Enter SAGE.

The paper introduces SAGE (a Top-Down Bottom-Up Knowledge-Grounded User Simulator). Think of SAGE not as a generic robot, but as a Master Method Actor for testing AI. It doesn't just "act" like a human; it acts like a specific human with a specific job, a specific budget, and a specific reason for being there.

Here is how SAGE works, broken down into simple metaphors:

1. The "Top-Down" Approach: The Casting Director

Imagine you are casting a play. You don't just say, "We need a customer." You say, "We need a 35-year-old project manager named Sarah, who is stressed about her budget, loves efficiency, and is looking for a cleaning robot for her office."

  • The Metaphor: This is the Top-Down part. SAGE uses "Ideal Customer Profiles" (ICPs). It builds a detailed character sheet for the simulator based on real business logic.
  • Why it matters: A generic simulator might ask, "How much does this cost?" A SAGE simulator, playing "Sarah the Project Manager," asks, "Does this robot fit within my $10,000 Q3 budget, and can it handle the high-traffic hallway in my office?" This forces the AI to answer specific, realistic questions.

2. The "Bottom-Up" Approach: The Scriptwriter

Now, imagine Sarah the Project Manager walks into the store. She has already read the company's website, looked at the product manual, and knows about the "50% off" sale mentioned in the FAQ.

  • The Metaphor: This is the Bottom-Up part. SAGE feeds the simulator the actual "script" of the business: the product catalogs, the FAQs, the technical specs, and the knowledge base.
  • Why it matters: If the AI says, "Our robot can fly," but the product manual says "It stays on the ground," a generic simulator might just nod along. SAGE, having read the manual, will immediately say, "Wait, your website says it's ground-only. Why are you telling me it flies?" This catches hallucinations (lies) and errors.

3. The "Method Acting" Session

When SAGE talks to the AI agent, it's not a robot talking to a robot. It's a fully realized character interacting with the AI.

  • It might be grumpy because it's late.
  • It might be confused because it read two different webpages.
  • It might ask a follow-up question based on a specific detail it "remembered" from the product catalog.

The Result: Finding the Bugs

The paper tested SAGE against other simulators and found that:

  • It's more realistic: The conversations sounded like real humans, not robots.
  • It's a better bug hunter: SAGE found 33% more errors than other methods.

Why? Because SAGE is the only one asking the hard, specific questions that real customers ask.

  • Generic Simulator: "Is this good?" -> AI: "Yes." -> Pass.
  • SAGE Simulator: "I need this for a farm in a dusty environment. Your manual says it's for indoor cafes. Does it have a dust seal?" -> AI: "Uh, yes, it's great for farms." -> FAIL. (The AI is lying; SAGE caught it).

The Bottom Line

SAGE is like a stress-test simulator for AI. Instead of just checking if the AI can say "Hello," it puts the AI in a complex, realistic scenario where it has to juggle a specific customer's personality, a specific budget, and a specific set of facts.

By combining who the customer is (Top-Down) with what the customer knows (Bottom-Up), SAGE creates a "digital twin" of a real customer. This helps companies find and fix their AI's mistakes before they ever annoy a real human, saving time, money, and reputation.