Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work

This controlled pilot study demonstrates that while explicit software delegation contracts do not improve the objective correctness of AI coding agents' outputs, they significantly enhance the reviewability of their work by increasing evidence sufficiency and reducing ambiguity, albeit at the cost of increased token usage and execution time.

Original authors: Vincent Schmalbach

Published 2026-06-17
📖 5 min read🧠 Deep dive

Original authors: Vincent Schmalbach

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a manager who hires a very smart, but sometimes chatty, intern to fix a leaky faucet in your house.

In the past, you might just say, "Hey, fix that leak." The intern goes off, does the work, and comes back with a wrench and a dry floor. You check the floor, it's dry, and you say, "Great job!"

But what if the intern is an AI? And what if, instead of just fixing the leak, the intern is also allowed to rearrange your furniture, repaint the walls, or even move the sink, as long as they tell you they did it?

This paper asks a simple question: If you give the intern a strict, written "contract" telling them exactly what they are allowed to do and exactly what proof they must bring back, does that make it easier for you to check their work?

The author, Vincent Schmalbach, ran a small experiment to find out. Here is what happened, explained simply.

The Setup: The "Toy" House

The researcher built a tiny, fake software project (a small website) with a few intentional "bugs" (leaks). He created 10 different tasks, like "fix the login button" or "update the instructions."

He then sent these tasks to two different AI "interns" (one very smart, one faster but slightly less smart) under three different rules:

  1. The Casual Ask: Just a normal message saying, "Fix this bug." (Like telling a friend to "fix the sink.")
  2. The Contract: A formal document saying, "You can only touch these two files. You cannot touch the database. When you are done, you must list every file you changed and explain why."
  3. The Contract + Evidence Checklist: The same contract, but with a mandatory checklist the AI had to fill out, including a section on "What could still go wrong" and a "Reviewer Checklist."

The Results: The "Dry Floor" vs. The "Report"

The study measured two things: Did the work actually get done? and Was it easy to review?

1. The Work Was Already Perfect (The "Dry Floor")
Surprisingly, it didn't matter which rule the AI followed. Whether they got a casual request or a strict contract, every single AI fixed the bugs perfectly.

  • The "leak" was fixed.
  • The AI didn't break anything else.
  • The AI didn't touch files they weren't supposed to.

Why? Because the tasks were small and the AI was smart enough to figure them out anyway. The "contract" didn't make the AI better at fixing the code. The work was already 100% correct.

2. The Review Became Much Easier (The "Report")
This is where the magic happened. Even though the code was perfect in both cases, the Contract made the AI's report much easier for a human to read and trust.

  • Without the contract: The AI would fix the bug and say, "Done." It rarely explained which files it changed or why. It was like the intern fixing the sink but leaving the tools scattered everywhere without a note.
  • With the contract: The AI provided a neat package. It listed every file changed, explained the reasoning, listed the tests it ran, and even admitted, "Here is a small risk that might still exist."

The Analogy:
Imagine the AI is a chef.

  • No Contract: The chef brings you a perfect steak. You eat it, and it's delicious. But you have no idea if they used fresh ingredients or if they washed their hands. You have to guess.
  • With Contract: The chef brings you the same perfect steak, but also brings a receipt showing the ingredients, a photo of the kitchen, and a note saying, "I cooked this for 4 minutes, but if you like it rare, you might want to cook it longer."
  • The Result: The steak tasted the same, but the second version was much easier to trust and approve.

The Cost: It Takes a Little Longer

The only downside was speed and "cost."

  • The AI used about 13% more "brain power" (tokens) to write the report.
  • It took about 38% longer to finish the task.

Think of it like paying for express shipping with a detailed tracking number. The package arrives at the same time (or slightly later), but you know exactly where it is and what's inside.

The Big Takeaway

The paper concludes that for small, clear tasks, you don't need a contract to get the work done, but you DO need a contract to get a good review.

  • Correctness: The AI is already good enough to fix small bugs on its own.
  • Reviewability: The AI is not good at explaining itself unless you explicitly ask it to.

The "contract" acts like a translator. It doesn't make the AI smarter; it just forces the AI to speak in a language that humans (or other AIs) can easily understand and verify.

A Note on the "Weaker" Intern

The study found that the "weaker" AI (the faster, cheaper one) benefited the most from the contract. The smarter AI naturally wrote good reports on its own, but the weaker one needed the contract to be forced to write them. This suggests that if you are using cheaper AI tools, you must use strict contracts to get reliable results.

Summary

  • Did the contract make the code better? No. The code was already perfect.
  • Did the contract make the report better? Yes, huge difference.
  • Did it cost extra? Yes, a little time and money.
  • Verdict: If you want to trust AI work, don't just ask for the fix. Ask for the contract that forces the AI to show its homework.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →