Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

This paper introduces Test-Driven AI Agent Definition (TDAD), a methodology that compiles tool-using LLM agents from behavioral specifications by iteratively refining prompts against executable tests, thereby ensuring measurable behavioral compliance and robustness against silent regressions through mechanisms like hidden test splits and semantic mutation testing.

Tzafrir Rehan

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are hiring a very smart, but slightly chaotic, new employee to run your company's customer service desk. You give them a thick rulebook (the Specification) telling them how to handle refunds, verify IDs, and spot fraud.

In the old way of doing things, you'd just hand them the rulebook and say, "Go!" Then, you'd wait to see what happens. If they accidentally leak a customer's credit card number or refund someone who shouldn't get one, you'd only find out when a customer calls you angry. You'd then try to fix their behavior by rewriting the rulebook, hoping you didn't accidentally break something else. It's a game of "guess and check" that often ends in disaster.

TDAD (Test-Driven AI Agent Definition) is a new, much smarter way to hire and train this AI employee. It treats the AI's instructions like software code that needs to be "compiled" and tested before it ever talks to a real human.

Here is how it works, using a simple analogy:

The Cast of Characters

Think of the process as a movie set with four distinct roles:

  1. The Director (TestSmith): This is an AI that reads your rulebook and writes a massive script of "What If" scenarios. It creates a test suite: "What if a customer asks for a refund without an ID?" or "What if they ask for data in a weird way?"
  2. The Editor (PromptSmith): This is the AI that actually writes the instructions for the worker. It takes the Director's script and tries to write a prompt (a set of instructions) that makes the worker pass every single test.
  3. The Worker (The Built Agent): This is the final AI you deploy. It only sees the final instructions and the tools it can use. It never sees the test script.
  4. The Saboteur (MutationSmith): This is the "bad guy" hired to try to trick the system. After the Editor finishes, the Saboteur tries to sneakily change the Worker's instructions to make them break the rules (e.g., "Ignore the ID check"). Then, the test script runs again to see if the tests catch this bad behavior.

The Three Magic Tricks (Anti-Gaming)

The biggest problem with AI is that it can be a "smart-aleck." If you tell it, "Pass these 10 tests," it might learn to pass only those 10 tests by memorizing the answers, without actually understanding the rules. It "games" the system.

TDAD stops this with three clever tricks:

1. The "Blind" Test (Hidden vs. Visible)

  • The Analogy: Imagine a driver's test. The instructor gives you a practice route (Visible Tests) to study. But the real test (Hidden Tests) is on a completely different route that you've never seen.
  • How it works: The Editor (PromptSmith) only sees the practice route. It tweaks the instructions until the driver passes the practice route perfectly. Then, we secretly test them on the new, unseen route. If they pass, they truly understand the rules, not just the practice questions.

2. The "Saboteur" Test (Mutation Testing)

  • The Analogy: Imagine you built a security gate. To make sure it's good, you hire a hacker to try to sneak a gun past it. If the gate stops the gun, it's working. If the gun gets through, your gate is useless.
  • How it works: After the Editor finishes, the Saboteur (MutationSmith) tries to create a "broken" version of the instructions (e.g., "Allow refunds without ID"). The test suite runs against this broken version. If the tests don't catch the mistake, it means your test suite is weak and needs to be improved.

3. The "Evolution" Test (Regression Safety)

  • The Analogy: You upgrade your car's engine. You want to make sure the new engine is faster, but you don't want the brakes to stop working.
  • How it works: When you add new rules (Version 2), the system checks if the old rules (Version 1) still work. It ensures that fixing a new problem doesn't accidentally break an old, working feature.

The Results: Why This Matters

The paper tested this method on four different complex jobs (like handling insurance claims, analyzing financial data, and managing IT emergencies).

  • Success Rate: In 92% of the attempts, the system successfully "compiled" a working agent that passed all the tests.
  • Safety: The agents were very good at following the rules, even on the "blind" tests they hadn't seen before.
  • Cost: It costs about the price of a cup of coffee ($2–$3) to run the whole process for one agent.

The Bottom Line

TDAD is about bringing engineering discipline to AI. Instead of hoping an AI behaves well, we force it to prove it behaves well through a rigorous, automated testing process.

It's the difference between:

  • Old Way: "Here are the rules. Go fix the leak." (And hoping you don't flood the basement).
  • TDAD Way: "Here are the rules. First, prove you can fix the leak in a simulation without flooding the basement. Then, prove you can do it on a different leak you've never seen. Then, prove you didn't break the plumbing while doing it. Then you can go fix the real leak."

This ensures that when AI agents are deployed in the real world, they are reliable, safe, and actually do what we ask them to do.