Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

This paper presents a practical blueprint for building and optimizing production-scale conversational shopping assistants by introducing a structured evaluation rubric with an LLM-as-judge pipeline and demonstrating two complementary prompt-optimization strategies, Sub-agent and MAMuT GEPA, to enhance multi-agent system performance.

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood, Hongtai Wei, Sudeep Das

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you've hired a personal shopper to help you buy groceries for a week. You tell them, "Get me my usual stuff, but keep it under $25."

In the old days, a computer program would just search for items under $25 and hope for the best. But today, we are building AI agents—smart, conversational assistants that can chat with you, understand your taste, check the store inventory in real-time, and build your cart for you.

The paper you're asking about is a "blueprint" (a master plan) from a team at DoorDash and WithMetis.ai. They built a real-life AI grocery assistant called MAGIC and figured out how to make it better, faster, and smarter.

Here is the story of how they did it, broken down into three simple steps: Build, Judge, and Optimize.


1. The Problem: Why One Brain Isn't Enough

In the beginning, MAGIC was like a one-man band. One single AI tried to do everything: listen to you, search the store, check your budget, and pick the best brands.

But grocery shopping is messy. You might say, "I want pasta," then change your mind to "Actually, gluten-free," then add, "Oh, and I need a wine pairing."

  • The Issue: When one brain tries to do all that, it gets confused. It forgets your budget when picking the wine, or it picks the wrong store because it's too busy thinking about the pasta.
  • The Fix: They broke the team apart. Now, MAGIC is a symphony orchestra.
    • There is a Conductor (the Orchestrator) who listens to you and breaks the task down.
    • There is a Search Specialist who finds the items.
    • There is a Budget Manager who checks the prices.
    • There is a Cart Builder who puts everything together.

This is better, but now they have a new problem: Coordination. If the Conductor talks too much, the Search Specialist gets overwhelmed. If the Budget Manager is too strict, the Cart Builder can't buy anything. How do you fix the whole team without breaking the individual players?

2. Step One: The "Judge" (How do we know if it's good?)

Before you can improve a team, you need a way to grade them. In the past, grading an AI was vague, like a teacher saying, "This essay is okay, maybe a B."

The authors created a Rubric (a strict checklist) with four categories:

  1. Did you get the right stuff? (Shopping Execution)
  2. Did you remember I hate cilantro? (Personalization)
  3. Did you sound like a helpful human? (Conversation Quality)
  4. Did you say anything weird or dangerous? (Safety)

The Magic Trick: They didn't just ask a human to grade every single shopping trip (that takes too long). Instead, they trained a Super-Judge AI (an LLM-as-a-Judge).

  • Think of this Super-Judge as a strict referee watching a game. It doesn't guess; it looks at the "replay" (the chat log) and checks specific facts: "Did the cart actually contain the gluten-free pasta? Yes/No."
  • They made this referee so smart that it agrees with human experts 91% of the time. Now, they can grade thousands of shopping trips instantly and objectively.

3. Step Two: The "Optimization" (How do we make it better?)

Once they had a reliable referee, they tried two different ways to coach the team.

Strategy A: The "Specialist Coach" (Sub-agent GEPA)

This is like hiring a separate coach for each player.

  • The Search Specialist gets a coach who only teaches them how to find items faster.
  • The Budget Manager gets a coach who only teaches them how to save money.
  • The Result: It works well for small mistakes. But it misses the big picture. The Search Specialist might get faster, but if they talk too much, they still annoy the Budget Manager. The team is still clunky.

Strategy B: The "Head Coach" (MAMUT GEPA)

This is the paper's big breakthrough. Instead of coaching individuals, they hired a Head Coach who watches the entire game from start to finish.

  • This coach simulates thousands of shopping trips.
  • They realize: "Hey, if the Conductor speaks a little less, the Search Specialist has more brainpower to find better deals, and the whole team wins."
  • They tweak the instructions for everyone at the same time to make the whole orchestra play in harmony.

The Result: The "Head Coach" approach (MAMUT) was much better. It fixed the "team chemistry" problems that the individual coaches missed. The AI became safer, more polite, and much better at remembering your preferences across a long conversation.

The Big Takeaway

The paper teaches us a valuable lesson about building AI systems:

You can't just make the individual parts perfect; you have to make the team work together.

If you have a group of brilliant AI agents, simply making them smarter individually won't fix the chaos if they don't know how to talk to each other. You need a system that evaluates the whole journey (the conversation from start to finish) and coaches the group as a unit.

In short: They built a grocery AI, created a strict referee to grade it, and discovered that coaching the whole team together works far better than coaching the players one by one.