Imagine you are the manager of a massive, bustling restaurant kitchen. This kitchen is built on a Microservices architecture. Instead of one giant chef trying to cook the entire meal, you have dozens of specialized stations: one for chopping vegetables, one for grilling steaks, one for making sauces, and one for plating. Each station is independent, but they all need to talk to each other perfectly to get the meal to the customer.
Now, imagine you hire a team of AI Chefs (AI Agents) to help you. You want to know: Can these AI chefs actually cook a new station from scratch, or add a new station to an existing kitchen without ruining the whole meal?
This paper is a "taste test" to see how good these AI Chefs are at the job. Here is the breakdown in simple terms:
1. The Two Ways to Ask for Help
The researchers tested the AI in two different scenarios, like giving them two different types of instructions:
Scenario A: "The Renovation" (Incremental Generation)
- The Setup: The kitchen already exists. The AI is told, "Go to the 'Sauce Station,' delete the current chef, and hire a new one. But keep the rest of the kitchen exactly as it is."
- The Challenge: The new AI chef must fit perfectly into the existing workflow. If they use a different type of knife or a different recipe format, the whole kitchen breaks.
- The Result: Surprisingly, the AI did better when given minimal instructions (just "Make a sauce station") rather than a giant, detailed manual. When the researchers gave them too much text to read, the AI got confused and started ignoring the existing kitchen rules. When left to "explore" the kitchen on its own, it figured out the rhythm better.
- Success Rate: About 50% to 76% of the time, the new station worked perfectly with the old one.
Scenario B: "The Greenfield Project" (Clean State Generation)
- The Setup: You are building a brand new restaurant in an empty field. You give the AI a list of requirements ("We need a sauce station that handles 500 orders an hour") but no existing kitchen to look at.
- The Challenge: The AI has to invent the whole structure from scratch.
- The Result: This went much better. The AI created stations that worked 81% to 98% of the time. Why? Because there was no "old kitchen" to clash with. The AI could build whatever it wanted, as long as it followed the rules.
- The Catch: The code was sometimes "weird" (different file names or structures), but since it was a new building, it didn't matter as long as the doors opened and the lights turned on.
2. The Three AI Chefs (The Agents)
The researchers tested three different AI "chefs":
- Codex (The Veteran): Very smart, but sometimes takes a long time to think (up to 1.7 hours for one task!) and can be expensive.
- Claude Code (The Precise One): Fast, very accurate, but the most expensive per meal. It writes very concise code (short recipes).
- Code Qwen (The Budget Option): The fastest and cheapest. It's great, but sometimes it gets stuck in a loop if the instructions aren't clear.
3. The "Secret Sauce" (What They Found)
- Less is More (Sometimes): In the "Renovation" scenario, giving the AI a huge, detailed summary of the existing code actually made it worse. It was like telling a chef, "Here is a 50-page manual on how we used to do things," and the chef got so focused on the manual they forgot to look at the actual kitchen. A simple prompt ("Just build the station") let the AI explore and adapt better.
- Simpler is Better: The code the AI wrote was actually simpler and less complex than code written by humans. It wasn't "dumber," just more efficient. It didn't over-complicate things.
- The "Memorization" Problem: The AI performed much better on famous, open-source projects (like a popular recipe book everyone knows) than on private, student projects. This suggests the AI might be "memorizing" answers it saw during its training rather than truly "thinking" through new problems.
- Cost vs. Speed:
- Fastest: Code Qwen (7.6 mins) and Claude (7.8 mins).
- Slowest: Codex (16.6 mins, sometimes taking over an hour!).
- Cheapest: Code Qwen ($3 per service).
- Most Expensive: Claude ($13 per service).
4. The Verdict: Are We There Yet?
Can AI generate microservices? Yes.
Can they do it fully on their own without a human watching? Not yet.
Think of the AI as a brilliant but inexperienced apprentice.
- They can cook a great dish if you give them a clean kitchen and a simple list of ingredients.
- They can also cook a decent dish in an existing kitchen if you let them look around first.
- However, they sometimes miss the tiny details (like a specific type of connector or a hidden rule) that a human chef would catch. If you let them run the whole kitchen alone, they might accidentally delete the database or break the connection between the grill and the oven.
The Big Takeaway
AI is ready to be a super-powered assistant for software architects. It can do the heavy lifting, write the code, and even make it simpler than humans do. But we still need a Human Architect in the loop to double-check the work, ensure the "stations" talk to each other correctly, and make sure the AI didn't get stuck in a loop or memorize the wrong recipe.
We are not at the point of "Press Button, Get Perfect Software," but we are definitely at the point of "Press Button, Get 80% of the work done, then fix the last 20%."