Runtime Burden Allocation for Structured LLM Routing in Agentic Expert Systems: A Full-Factorial Cross-Backend Methodology

This paper reframes structured LLM routing as a systems-level burden-allocation problem and demonstrates through a comprehensive cross-backend benchmark that optimal routing strategies are highly dependent on specific model interactions rather than universal rules, offering a practical framework for balancing correctness, latency, and cost in agentic AI systems.

Zhou Hanlin, Chan Huah Yong

Published 2026-04-03
📖 5 min read🧠 Deep dive

The Big Idea: It's Not Just About the "Brain," It's About the "Delivery"

Imagine you are running a busy hospital emergency room. You have a highly intelligent Doctor (the AI Model) who can diagnose any problem. But before the Doctor can treat a patient, a Triage Nurse (the Router) must decide which department the patient goes to: Surgery, Pediatrics, Cardiology, or General Medicine.

For years, researchers have been obsessed with making the Doctor smarter. They ask, "Which Doctor is the best?"

This paper argues that we are asking the wrong question. The real problem isn't just who the Doctor is; it's how the Triage Nurse writes the referral note.

The authors call this "Runtime Burden Allocation." In plain English: Who does the heavy lifting of formatting the instructions?

  1. Option A: The Doctor writes the full, perfect, machine-readable referral note themselves.
  2. Option B: The Doctor writes a quick, messy shorthand note, and a Computer Script (local software) cleans it up and turns it into a perfect referral note.

The paper asks: Does it matter which option we choose?

The Experiment: A Massive "What-If" Game

To find the answer, the researchers set up a giant experiment. They didn't just test one thing; they tested every possible combination of:

  • 3 Different Doctors: OpenAI (GPT), Google (Gemini), and Llama (an open-source model).
  • 4 Different Ways of Writing Notes: From "Perfect JSON" (Option A) to "Shorthand + Cleanup" (Option B).
  • Different Constraints: Like giving the Doctor more time or less time.

They ran over 15,500 test cases. It was like running a full-scale simulation of a hospital for a month to see what happens when you change the paperwork rules.

The Shocking Discovery: There is No "One Size Fits All"

If you asked a typical tech expert, "Which way of writing notes is best?" they would say, "The one that is fastest and cheapest!"

The paper says: Nope.

The best method depends entirely on which Doctor you are using.

  • For the "Smart & Stable" Doctors (OpenAI & Google):

    • If you let them write the messy shorthand and clean it up later (Option B), they get faster and cheaper, but they start making terrible mistakes in their diagnosis. They get confused by the shorthand.
    • Analogy: It's like asking a brilliant surgeon to write a prescription in crayon. They might be faster, but the pharmacist (the computer script) might misread it, and the patient gets the wrong medicine.
    • Verdict: Let these Doctors write the full, perfect note themselves. It's safer.
  • For the "Fast but Fragile" Doctor (Llama):

    • If you ask this Doctor to write the full perfect note, they are actually quite good at it.
    • But if you try to use the "Shorthand + Cleanup" method, the system completely collapses. The Doctor gets so confused by the shorthand that the cleanup script can't fix it. The referral note becomes garbage.
    • Verdict: This Doctor is actually the fastest at writing the full note, but the "cleanup" method breaks their brain.

The "Streaming" Myth

The paper also tested Streaming. This is when the Doctor starts talking to you before they finish their whole thought (like seeing the first few words of a text message before the rest arrives).

  • The Finding: For a Triage Nurse, streaming is useless.
  • Analogy: Imagine the Triage Nurse starts handing you a piece of paper that says "Send to..." but stops there. You can't send the patient anywhere until the whole sentence is written. Seeing the first few words doesn't help you move the patient faster. You have to wait for the full sentence anyway.
  • Lesson: Don't waste energy trying to make the AI "stream" its answers for control tasks. Just wait for the full answer.

The Three Golden Rules for Developers

Based on this study, here is how you should build your AI system:

  1. Don't assume the "Fastest" method is the best.
    Just because a method saves money or time doesn't mean it works. If you switch to a "shorthand" method, you might save 50% on costs but lose 50% of your accuracy.
  2. Match the Method to the Model.
    You cannot use the same "paperwork rules" for every AI. What works for Google's AI might break OpenAI's AI or Llama. You have to test them separately.
  3. Protect the "Specialist" Routes.
    The study found that while the AI might be good at sending patients to "General Medicine," it might fail miserably at sending them to "Neurosurgery" if you use the wrong paperwork method. If a mistake is expensive (like sending a patient to the wrong specialist), stick to the safe, full-note method.

The Bottom Line

This paper is a wake-up call for AI engineers. We have been treating AI routing like a simple chatbot conversation. But in real-world systems (like expert systems), how you package the AI's answer is just as important as the answer itself.

You can't just pick the "best" AI and the "best" speed setting and hope for the best. You have to find the perfect match between your specific AI and the way you ask it to write its instructions. It's not about finding the fastest car; it's about knowing which car handles best on your specific road.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →