Software Delegation Contracts: Measuring Reviewability… — Plain-Language Explanation

Imagine you are a manager who hires a very smart, but sometimes chatty, intern to fix a leaky faucet in your house.

In the past, you might just say, "Hey, fix that leak." The intern goes off, does the work, and comes back with a wrench and a dry floor. You check the floor, it's dry, and you say, "Great job!"

But what if the intern is an AI? And what if, instead of just fixing the leak, the intern is also allowed to rearrange your furniture, repaint the walls, or even move the sink, as long as they tell you they did it?

This paper asks a simple question: If you give the intern a strict, written "contract" telling them exactly what they are allowed to do and exactly what proof they must bring back, does that make it easier for you to check their work?

The author, Vincent Schmalbach, ran a small experiment to find out. Here is what happened, explained simply.

The Setup: The "Toy" House

The researcher built a tiny, fake software project (a small website) with a few intentional "bugs" (leaks). He created 10 different tasks, like "fix the login button" or "update the instructions."

He then sent these tasks to two different AI "interns" (one very smart, one faster but slightly less smart) under three different rules:

The Casual Ask: Just a normal message saying, "Fix this bug." (Like telling a friend to "fix the sink.")
The Contract: A formal document saying, "You can only touch these two files. You cannot touch the database. When you are done, you must list every file you changed and explain why."
The Contract + Evidence Checklist: The same contract, but with a mandatory checklist the AI had to fill out, including a section on "What could still go wrong" and a "Reviewer Checklist."

The Results: The "Dry Floor" vs. The "Report"

The study measured two things: Did the work actually get done? and Was it easy to review?

1. The Work Was Already Perfect (The "Dry Floor")
Surprisingly, it didn't matter which rule the AI followed. Whether they got a casual request or a strict contract, every single AI fixed the bugs perfectly.

The "leak" was fixed.
The AI didn't break anything else.
The AI didn't touch files they weren't supposed to.

Why? Because the tasks were small and the AI was smart enough to figure them out anyway. The "contract" didn't make the AI better at fixing the code. The work was already 100% correct.

2. The Review Became Much Easier (The "Report")
This is where the magic happened. Even though the code was perfect in both cases, the Contract made the AI's report much easier for a human to read and trust.

Without the contract: The AI would fix the bug and say, "Done." It rarely explained which files it changed or why. It was like the intern fixing the sink but leaving the tools scattered everywhere without a note.
With the contract: The AI provided a neat package. It listed every file changed, explained the reasoning, listed the tests it ran, and even admitted, "Here is a small risk that might still exist."

The Analogy:
Imagine the AI is a chef.

No Contract: The chef brings you a perfect steak. You eat it, and it's delicious. But you have no idea if they used fresh ingredients or if they washed their hands. You have to guess.
With Contract: The chef brings you the same perfect steak, but also brings a receipt showing the ingredients, a photo of the kitchen, and a note saying, "I cooked this for 4 minutes, but if you like it rare, you might want to cook it longer."
The Result: The steak tasted the same, but the second version was much easier to trust and approve.

The Cost: It Takes a Little Longer

The only downside was speed and "cost."

The AI used about 13% more "brain power" (tokens) to write the report.
It took about 38% longer to finish the task.

Think of it like paying for express shipping with a detailed tracking number. The package arrives at the same time (or slightly later), but you know exactly where it is and what's inside.

The Big Takeaway

The paper concludes that for small, clear tasks, you don't need a contract to get the work done, but you DO need a contract to get a good review.

Correctness: The AI is already good enough to fix small bugs on its own.
Reviewability: The AI is not good at explaining itself unless you explicitly ask it to.

The "contract" acts like a translator. It doesn't make the AI smarter; it just forces the AI to speak in a language that humans (or other AIs) can easily understand and verify.

A Note on the "Weaker" Intern

The study found that the "weaker" AI (the faster, cheaper one) benefited the most from the contract. The smarter AI naturally wrote good reports on its own, but the weaker one needed the contract to be forced to write them. This suggests that if you are using cheaper AI tools, you must use strict contracts to get reliable results.

Summary

Did the contract make the code better? No. The code was already perfect.
Did the contract make the report better? Yes, huge difference.
Did it cost extra? Yes, a little time and money.
Verdict: If you want to trust AI work, don't just ask for the fix. Ask for the contract that forces the AI to show its homework.

Technical Summary: Software Delegation Contracts in AI Coding-Agent Work

Problem Statement
As AI coding agents transition from interactive completion to delegated execution—accepting tasks, operating within bounded authority, and returning work packages for human review—the critical question for human supervisors shifts from "Does the patch work?" to "Can I review this?" Previous conceptual work proposed the software delegation contract (defined as the tuple of Task, Authority, Work Package, and Acceptance Context) as the unit of analysis for this relationship. However, this framework lacked empirical validation. This paper addresses the gap by investigating whether making the delegation contract explicit improves the reviewability of returned work packages and quantifies the associated costs.

Methodology
The authors conducted a controlled pilot study using a purpose-built experimental harness:

Environment: A dependency-free TypeScript HTTP API (~600 lines) with 29 tests. Ten tasks were seeded across five families (failing-test fixes, validation bugs, missing tests, scoped refactors, and documentation updates), each with specific ground-truth defects and latent issues.
Agents: Runs were executed using two model tiers (Claude Sonnet 4.6 and Haiku 4.5) within a bounded sandbox (no network, no external dependencies, no commits).
Conditions: 64 total runs were distributed across three conditions:
1. Baseline (A): A realistic, issue-style prompt with natural scope hints.
2. Explicit Contract (B): A structured document defining objectives, non-goals, authority boundaries, and required evidence.
3. Contract + Evidence Bundle (C): Condition B plus a mandatory template for the work package (including residual risks and reviewer checklists).
Evaluation:
- Mechanical Scoring: Automated checks for hidden acceptance tests, mutation testing (injecting defects to verify new tests), and scope violations.
- Blinded Review: Three independent, model-based reviewers (Sonnet 4.6) scored every work package on a fixed rubric (192 total reviews) covering evidence sufficiency, ambiguity, scope compliance, and confidence. Reviewers were blinded to the condition labels.
- Analysis: Paired statistical analysis (Wilcoxon signed-rank, Cliff's delta) compared Baseline vs. Contract conditions.

Key Results

Objective Outcomes Saturated: All 64 runs passed hidden acceptance checks and mutation tests. No scope violations occurred in any condition. On small, well-specified tasks with capable models, explicit contracts did not improve the correctness of the code; both conditions achieved near-perfect objective success.
Reviewability Improved Significantly:
- Evidence Sufficiency: Improved by +0.83 points (on a 5-point scale) in 22 of 30 paired comparisons, with no worsening cases ( $p < 0.0001$ , Cliff's $\delta = 0.66$ ).
- Ambiguity: Decreased significantly ( $p = 0.035$ ).
- Structured Elements: Under the contract conditions, structured elements appeared almost exclusively when demanded. For example, "Changed files listed with reasons" rose from 7% (Baseline) to 93% (Contract), and "Known limitations" rose from 0% to 80%. Crucially, "Residual risks" and "Reviewer checklists" appeared in 0% of Baseline runs but 100% of runs where the evidence bundle explicitly demanded them.
Cost Overhead: The explicit contract incurred measurable costs:
- Agent tokens increased by 13%.
- Wall-clock time increased by 38%.
- Tool invocations increased by 23%.
- Patch size increased by 44.7% (driven primarily by additional tests, not production code churn).
Model Tier Effects: The benefit of contracts was roughly twice as large for the weaker model tier (Haiku), suggesting contracts partially substitute for the reporting discipline that stronger models exhibit unprompted.

Key Contributions

Experimental Harness: A reusable framework for delegation-contract studies, featuring a seeded task repository, paired prompts, hidden acceptance/mutation checks, and a blinded review pipeline.
Empirical Comparison: The first paired comparison of prompt-only versus contract-based delegation, yielding 64 runs and 192 blinded reviews.
Mechanism Validation: Evidence that delegation contracts function primarily as a reviewability mechanism rather than a correctness mechanism for small tasks.
Design Defaults: Concrete recommendations for agent harness builders, such as mandating "changed-files-with-reasons" lists and "known-limitations" sections, while acknowledging the ~15% token and ~40% latency overhead.

Significance and Claims
The paper concludes that for small, well-specified tasks handled by capable models, the marginal return of an explicit contract on outcome quality is zero because outcomes already saturate. However, the return on evidence is large and uniform. The authors assert that the delegation contract is a control layer that shapes the work package (the artifact presented for review) more than the work (the code itself).

The study posits that "evidence is demand-elastic": agents do not spontaneously provide the full context required for review (e.g., residual risks, checklists) unless explicitly contracted to do so. Therefore, for a software delegate to be useful, its work must be reviewable, and the delegation contract is a practical instrument to ensure that reviewability. The authors note that future iterations must test harder tasks where baseline runs might fail or drift to measure the contract's impact on authority control and error reduction.

Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work