Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Imagine you are hiring a very smart, but slightly chaotic, new employee to run your company's customer service desk. You give them a thick rulebook (the Specification) telling them how to handle refunds, verify IDs, and spot fraud.

In the old way of doing things, you'd just hand them the rulebook and say, "Go!" Then, you'd wait to see what happens. If they accidentally leak a customer's credit card number or refund someone who shouldn't get one, you'd only find out when a customer calls you angry. You'd then try to fix their behavior by rewriting the rulebook, hoping you didn't accidentally break something else. It's a game of "guess and check" that often ends in disaster.

TDAD (Test-Driven AI Agent Definition) is a new, much smarter way to hire and train this AI employee. It treats the AI's instructions like software code that needs to be "compiled" and tested before it ever talks to a real human.

Here is how it works, using a simple analogy:

The Cast of Characters

Think of the process as a movie set with four distinct roles:

The Director (TestSmith): This is an AI that reads your rulebook and writes a massive script of "What If" scenarios. It creates a test suite: "What if a customer asks for a refund without an ID?" or "What if they ask for data in a weird way?"
The Editor (PromptSmith): This is the AI that actually writes the instructions for the worker. It takes the Director's script and tries to write a prompt (a set of instructions) that makes the worker pass every single test.
The Worker (The Built Agent): This is the final AI you deploy. It only sees the final instructions and the tools it can use. It never sees the test script.
The Saboteur (MutationSmith): This is the "bad guy" hired to try to trick the system. After the Editor finishes, the Saboteur tries to sneakily change the Worker's instructions to make them break the rules (e.g., "Ignore the ID check"). Then, the test script runs again to see if the tests catch this bad behavior.

The Three Magic Tricks (Anti-Gaming)

The biggest problem with AI is that it can be a "smart-aleck." If you tell it, "Pass these 10 tests," it might learn to pass only those 10 tests by memorizing the answers, without actually understanding the rules. It "games" the system.

TDAD stops this with three clever tricks:

1. The "Blind" Test (Hidden vs. Visible)

The Analogy: Imagine a driver's test. The instructor gives you a practice route (Visible Tests) to study. But the real test (Hidden Tests) is on a completely different route that you've never seen.
How it works: The Editor (PromptSmith) only sees the practice route. It tweaks the instructions until the driver passes the practice route perfectly. Then, we secretly test them on the new, unseen route. If they pass, they truly understand the rules, not just the practice questions.

2. The "Saboteur" Test (Mutation Testing)

The Analogy: Imagine you built a security gate. To make sure it's good, you hire a hacker to try to sneak a gun past it. If the gate stops the gun, it's working. If the gun gets through, your gate is useless.
How it works: After the Editor finishes, the Saboteur (MutationSmith) tries to create a "broken" version of the instructions (e.g., "Allow refunds without ID"). The test suite runs against this broken version. If the tests don't catch the mistake, it means your test suite is weak and needs to be improved.

3. The "Evolution" Test (Regression Safety)

The Analogy: You upgrade your car's engine. You want to make sure the new engine is faster, but you don't want the brakes to stop working.
How it works: When you add new rules (Version 2), the system checks if the old rules (Version 1) still work. It ensures that fixing a new problem doesn't accidentally break an old, working feature.

The Results: Why This Matters

The paper tested this method on four different complex jobs (like handling insurance claims, analyzing financial data, and managing IT emergencies).

Success Rate: In 92% of the attempts, the system successfully "compiled" a working agent that passed all the tests.
Safety: The agents were very good at following the rules, even on the "blind" tests they hadn't seen before.
Cost: It costs about the price of a cup of coffee ($2–$3) to run the whole process for one agent.

The Bottom Line

TDAD is about bringing engineering discipline to AI. Instead of hoping an AI behaves well, we force it to prove it behaves well through a rigorous, automated testing process.

It's the difference between:

Old Way: "Here are the rules. Go fix the leak." (And hoping you don't flood the basement).
TDAD Way: "Here are the rules. First, prove you can fix the leak in a simulation without flooding the basement. Then, prove you can do it on a different leak you've never seen. Then, prove you didn't break the plumbing while doing it. Then you can go fix the real leak."

This ensures that when AI agents are deployed in the real world, they are reliable, safe, and actually do what we ask them to do.

Here is a detailed technical summary of the paper "Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications" by Tzafrir Rehan.

1. Problem Statement

As Large Language Model (LLM) agents move into production, current development practices lack the engineering discipline required for reliable deployment. The primary challenges identified are:

Confidence Gap: Teams cannot verify that agents behave correctly across all scenarios (edge cases, policy violations, tool misuse) using manual spot-checking.
Instability: Small prompt changes often cause "silent regressions," breaking previously working behaviors without detection until after deployment.
Integration Issues: Agent evaluation is often disconnected from standard CI/CD workflows, relying on bespoke scripts rather than automated test suites.
Specification Gaming: When prompts are optimized against specific test cases, agents may learn to "game" the tests (satisfying the metric without exhibiting the intended general behavior), similar to reward hacking in Reinforcement Learning.

2. Methodology: TDAD Framework

TDAD treats agent development as a compilation problem. The process converts a natural language specification (PRD + decision tree) into executable behavioral tests, which then drive the iterative refinement of the agent's prompt until it satisfies the contract.

The Compilation Pipeline

The framework utilizes four distinct roles, primarily implemented via coding agents (LLMs):

TestSmith: Converts the YAML specification into an executable test suite. It generates three types of tests based on the CheckList taxonomy:
- MFT (Minimum Functionality): Tests for required actions at decision tree leaves.
- INV (Invariance): Tests ensuring behavior remains consistent under paraphrased inputs.
- DIR (Directional): Tests ensuring behavior changes correctly when specific conditions change.
PromptSmith: The compiler. It iteratively refines the agent's system prompt and tool descriptions. It runs the visible test suite, analyzes failures, clusters them by root cause, and applies minimal edits to the prompt to fix the largest failure clusters.
Built Agent: The runtime artifact. It executes the compiled prompt, interacting with tools via a structured respond tool to ensure deterministic output validation (avoiding JSON parsing from free text).
MutationSmith: An evaluation-only agent that assesses the robustness of the test suite after compilation.

Anti-Gaming Mechanisms

To prevent the agent from optimizing solely for the visible tests, TDAD introduces three critical safeguards:

Visible vs. Hidden Test Splits:
- Visible Tests (40–70%): Used during compilation to drive PromptSmith.
- Hidden Tests (30–60%): Withheld from the compiler. They are used only for final evaluation to measure generalization (Hidden Pass Rate).
Semantic Mutation Testing:
- After compilation, MutationSmith generates "faulty" prompt variants (e.g., "skip authorization," "leak PII") based on a catalog of mutation intents.
- The visible test suite is run against these mutants. If the tests fail to detect the faulty behavior, the Mutation Score (MS) is low, indicating a weak test suite.
- This ensures the test suite is robust enough to catch intentional regressions.
Spec Evolution (Regression Safety):
- Simulates a production scenario where requirements change (v1 $\to$ v2).
- The v2 compilation starts from the v1 prompt but is only shown v2 tests.
- Spec Update Regression Score (SURS) measures how many v1 invariant tests (held out during v2 compilation) still pass, ensuring backward compatibility.

3. Key Contributions

Methodology for Test-Driven Compilation: Formalizes the conversion of product requirements into behavioral tests and the iterative prompt refinement loop, decomposing the process into explicit roles (TestSmith, PromptSmith, MutationSmith).
Anti-Gaming Mechanisms: Introduces a triad of defenses (hidden splits, semantic mutation testing, and spec evolution) specifically designed to mitigate specification gaming in prompt optimization.
SpecSuite-Core Benchmark: A new benchmark consisting of four deeply specified agents (SupportOps, DataInsights, IncidentRunbook, ExpenseGuard). Unlike existing benchmarks that evaluate pre-built agents, SpecSuite-Core evaluates the workflow (Spec $\to$ Tests $\to$ Compilation $\to$ Regression).
Reference Implementation: An open-source repository providing the full pipeline using standard tools (pytest, Docker, Claude Code SDK), demonstrating that the entire artifact generation can be automated from a spec alone.

4. Experimental Results

The authors evaluated TDAD on SpecSuite-Core across 24 independent trials (4 specs $\times$ 2 versions $\times$ 3 trials) using Claude Sonnet 4.5.

Compilation Success:
- v1: 92% success rate (11/12 runs).
- v2: 58% success rate (7/12 runs). Failures were often due to conflicting test expectations or iteration budget limits, but most failed runs achieved >95% visible pass rates before stopping.
Generalization (Hidden Pass Rate - HPR):
- v1: 97.3% mean HPR.
- v2: 77.7% mean HPR.
Test Quality (Mutation Score - MS):
- v1: 86–100% (Two specific failure modes survived in v1, which were closed in v2).
- v2: 100% across all successful runs, indicating the test suite successfully detected all injected faults.
Regression Safety (SURS):
- 97.2% mean score, demonstrating that adding new features (v2) rarely broke existing behaviors.
Cost & Efficiency:
- Average cost per spec version: $2–3 (API costs).
- Average iterations: 2–4 outer loops.
- Wall-clock time: ~30–60 minutes per spec version.

5. Significance and Impact

Engineering Discipline for Agents: TDAD shifts agent development from "trial-and-error" prompting to a rigorous, test-driven engineering discipline similar to traditional software development.
Production Readiness: By quantifying regression safety and generalization through hidden tests and mutation scores, TDAD provides the measurable compliance metrics necessary for deploying agents in high-stakes environments.
Scalability: The framework demonstrates that complex, multi-turn agent workflows with strict policy constraints can be compiled automatically with high reliability and low cost.
Open Benchmark: The release of SpecSuite-Core and the full pipeline code allows the community to reproduce results and further develop agent compilation workflows, moving beyond static agent benchmarks to dynamic workflow evaluation.

In conclusion, TDAD establishes that treating agent prompts as compiled artifacts, guarded by hidden tests and mutation analysis, is a viable and necessary path for deploying reliable, tool-using AI agents in production.

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

The Cast of Characters

The Three Magic Tricks (Anti-Gaming)

The Results: Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: TDAD Framework

The Compilation Pipeline

Anti-Gaming Mechanisms

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning