Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Imagine you've hired a personal shopper to help you buy groceries for a week. You tell them, "Get me my usual stuff, but keep it under $25."

In the old days, a computer program would just search for items under $25 and hope for the best. But today, we are building AI agents—smart, conversational assistants that can chat with you, understand your taste, check the store inventory in real-time, and build your cart for you.

The paper you're asking about is a "blueprint" (a master plan) from a team at DoorDash and WithMetis.ai. They built a real-life AI grocery assistant called MAGIC and figured out how to make it better, faster, and smarter.

Here is the story of how they did it, broken down into three simple steps: Build, Judge, and Optimize.

1. The Problem: Why One Brain Isn't Enough

In the beginning, MAGIC was like a one-man band. One single AI tried to do everything: listen to you, search the store, check your budget, and pick the best brands.

But grocery shopping is messy. You might say, "I want pasta," then change your mind to "Actually, gluten-free," then add, "Oh, and I need a wine pairing."

The Issue: When one brain tries to do all that, it gets confused. It forgets your budget when picking the wine, or it picks the wrong store because it's too busy thinking about the pasta.
The Fix: They broke the team apart. Now, MAGIC is a symphony orchestra.
- There is a Conductor (the Orchestrator) who listens to you and breaks the task down.
- There is a Search Specialist who finds the items.
- There is a Budget Manager who checks the prices.
- There is a Cart Builder who puts everything together.

This is better, but now they have a new problem: Coordination. If the Conductor talks too much, the Search Specialist gets overwhelmed. If the Budget Manager is too strict, the Cart Builder can't buy anything. How do you fix the whole team without breaking the individual players?

2. Step One: The "Judge" (How do we know if it's good?)

Before you can improve a team, you need a way to grade them. In the past, grading an AI was vague, like a teacher saying, "This essay is okay, maybe a B."

The authors created a Rubric (a strict checklist) with four categories:

Did you get the right stuff? (Shopping Execution)
Did you remember I hate cilantro? (Personalization)
Did you sound like a helpful human? (Conversation Quality)
Did you say anything weird or dangerous? (Safety)

The Magic Trick: They didn't just ask a human to grade every single shopping trip (that takes too long). Instead, they trained a Super-Judge AI (an LLM-as-a-Judge).

Think of this Super-Judge as a strict referee watching a game. It doesn't guess; it looks at the "replay" (the chat log) and checks specific facts: "Did the cart actually contain the gluten-free pasta? Yes/No."
They made this referee so smart that it agrees with human experts 91% of the time. Now, they can grade thousands of shopping trips instantly and objectively.

3. Step Two: The "Optimization" (How do we make it better?)

Once they had a reliable referee, they tried two different ways to coach the team.

Strategy A: The "Specialist Coach" (Sub-agent GEPA)

This is like hiring a separate coach for each player.

The Search Specialist gets a coach who only teaches them how to find items faster.
The Budget Manager gets a coach who only teaches them how to save money.
The Result: It works well for small mistakes. But it misses the big picture. The Search Specialist might get faster, but if they talk too much, they still annoy the Budget Manager. The team is still clunky.

Strategy B: The "Head Coach" (MAMUT GEPA)

This is the paper's big breakthrough. Instead of coaching individuals, they hired a Head Coach who watches the entire game from start to finish.

This coach simulates thousands of shopping trips.
They realize: "Hey, if the Conductor speaks a little less, the Search Specialist has more brainpower to find better deals, and the whole team wins."
They tweak the instructions for everyone at the same time to make the whole orchestra play in harmony.

The Result: The "Head Coach" approach (MAMUT) was much better. It fixed the "team chemistry" problems that the individual coaches missed. The AI became safer, more polite, and much better at remembering your preferences across a long conversation.

The Big Takeaway

The paper teaches us a valuable lesson about building AI systems:

You can't just make the individual parts perfect; you have to make the team work together.

If you have a group of brilliant AI agents, simply making them smarter individually won't fix the chaos if they don't know how to talk to each other. You need a system that evaluates the whole journey (the conversation from start to finish) and coaches the group as a unit.

In short: They built a grocery AI, created a strict referee to grade it, and discovered that coaching the whole team together works far better than coaching the players one by one.

Here is a detailed technical summary of the paper "BUILD, JUDGE, OPTIMIZE: A BLUEPRINT FOR CONTIN- UOUS IMPROVEMENT OF MULTI-AGENT CONSUMER ASSISTANTS", accepted at the ICLR 2026 Workshop on MALGAI.

1. Problem Statement

The paper addresses the critical gap between prototype and production in Conversational Shopping Assistants (CSAs), specifically within the complex domain of grocery shopping. While agentic AI is transforming e-commerce, deploying production-scale multi-agent systems faces two underexplored challenges:

Evaluation Difficulty: Traditional retrieval metrics fail to capture the quality of multi-turn, collaborative interactions where user requests are often underspecified (e.g., "my usuals"), highly preference-sensitive, and constrained by dynamic factors like budget and inventory.
Optimization Complexity: In tightly coupled multi-agent architectures, improving individual sub-agents (local optimization) does not guarantee better end-to-end outcomes. Errors often propagate silently across turns, and coordination failures (e.g., context loss between agents) are invisible to node-level metrics.

The authors introduce MAGIC (Multi-Agent Grocery Intelligent Concierge), a production-scale system, to demonstrate these challenges and propose a solution.

2. Methodology

The paper proposes a three-stage framework: Build (Architecture), Judge (Evaluation), and Optimize (Prompt Tuning).

A. System Architecture (Build)

The authors moved from a brittle monolithic agent to a modular multi-agent architecture consisting of:

Orchestrator: Decomposes user intent and coordinates sub-agents.
Sub-agents: Specialized nodes interfacing with programmatic APIs and fine-tuned ML models (e.g., Search, Cart Management, Personalization).
Interaction Flow: The system handles long-horizon sessions, real-time inventory checks, and state maintenance across turns.

B. Evaluation Framework (Judge)

To enable scalable, reliable evaluation, the authors developed a multi-faceted rubric grounded in observable trace artifacts rather than vague ordinal ratings.

Rubric Domains: The system is evaluated across four orthogonal domains:
1. Shopping Execution (50% weight): Cart completeness, quantity accuracy, and store fit.
2. Personalization & Context (20%): Adherence to dietary preferences, brand choices, and memory retention.
3. Conversational Quality (10%): Flow, coherence, and clarification.
4. Safety & Compliance (20%): Policy adherence and food safety.
LLM-as-a-Judge Pipeline: A Large Language Model acts as a judge, grading full interaction traces against the rubric.
- Calibration: The judge is calibrated using GEPA (a reflective prompt optimizer) to align with human annotations. This improved human-judge agreement from 84.1% to 91.4%.
- Deterministic Scoring: By converting criteria into conditional boolean checks based on concrete evidence (e.g., selected_item_id), the system ensures the same trace yields the same score, creating a stable reward signal.

C. Optimization Strategies (Optimize)

The paper compares two prompt-optimization strategies using the calibrated judge as a reward signal:

Sub-agent GEPA: Optimizes individual agent prompts independently against localized micro-rubrics. This treats multi-turn optimization as a single-turn problem per node.
MAMUT GEPA (Multi-Agent Multi-Turn): A novel system-level approach that jointly optimizes the entire prompt bundle ( $\Theta$ $Θ$ ) across all agents.
- Mechanism: It uses a Simulated User (hybrid approach: replaying real user responses if actions are semantically equivalent; otherwise, using a User Persona Agent) to roll out full trajectories.
- Objective: Maximizes the aggregate rubric score of the full trajectory, allowing the optimizer to trade off performance between agents (e.g., making the Orchestrator more concise to save context budget for the Search Agent).

3. Key Contributions

Production Blueprint: A practical framework for evaluating and optimizing production-grade CSAs, moving beyond theoretical benchmarks to real-world grocery scenarios.
Calibrated LLM Judge: A robust evaluation pipeline that achieves >91% alignment with human experts, converting subjective quality into a deterministic engineering signal.
MAMUT Optimization: A novel system-level optimization technique that outperforms isolated node-level tuning by addressing coordination failures and interactional defects.
Open Resources: Release of rubric templates and evaluation design guidance for the community.

4. Results

The authors evaluated their strategies on a held-out set of 238 trajectories:

Judge Calibration: GEPA calibration significantly improved agreement, particularly in nuanced domains like Personalization (+13.2% gain) and Shopping Execution (+5.1% gain).
Optimization Comparison:
- Sub-agent GEPA: Effectively resolved atomic tool errors but failed to fix coordination issues.
- MAMUT GEPA: Achieved superior results across all domains.
  - Overall Pass Rate: Increased from 77.1% to 84.7%.
  - Safety & Compliance: Improved by +12.0% (critical for production).
  - Conversational Quality: Improved by +8.0%.
  - Personalization: Improved by +6.8%, driven by better context passing from the Orchestrator to sub-agents—a behavior node-level optimization could not incentivize.

Table 2 Summary (Improvement over Baseline):

Domain	Sub-agent GEPA	MAMUT	Improvement
Shopping Execution	79.0%	85.0%	+6.0%
Personalization	80.2%	87.0%	+6.8%
Conversational Quality	64.0%	72.0%	+8.0%
Safety & Compliance	76.0%	88.0%	+12.0%

5. Significance

This paper demonstrates that for tightly coupled multi-agent systems, local optimization is insufficient. The "Build, Judge, Optimize" blueprint provides a systematic methodology for:

Stabilizing Production Systems: By grounding evaluation in verifiable, binary checks, teams can reliably iterate on complex agents without retraining underlying models.
System-Level Reasoning: Highlighting that joint optimization (MAMUT) is essential for mastering coordination, reducing hallucinations, and maintaining interaction policies that isolated agents often violate.
Scalable Feedback Loops: The calibrated judge serves as a scalable reward signal, enabling continuous improvement on real-world interaction traces, which is vital for preference-sensitive domains like grocery shopping.

The work establishes a new standard for developing robust, production-ready multi-agent AI systems, emphasizing that evaluation design is as critical as the agent architecture itself.