Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects

This paper evaluates the capability of the AI-powered IDE Cursor to generate large-scale software projects using a Feature-Driven Human-In-The-Loop framework, finding that while the tool achieves high functional correctness, the resulting systems frequently exhibit significant design issues—such as code duplication and complexity violations—that threaten long-term maintainability and require experienced developer review.

Original authors: Syed Mohammad Kashif, Ruiyin Li, Peng Liang, Amjed Tahir, Qiong Feng, Zengyang Li, Mojtaba Shahin

Published 2026-04-09✓ Author reviewed
📖 6 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you've hired a super-fast, incredibly talented robot chef named Cursor. This robot doesn't just chop a single onion; it can cook an entire banquet, from the appetizers to the dessert, all on its own.

For a long time, we thought these AI chefs could only make a single perfect cookie (a code snippet). But recently, people started asking: "Can this robot actually build a whole restaurant kitchen, complete with plumbing, electrical wiring, and a menu, without us holding its hand every second?"

This paper is the story of a team of researchers who decided to test this robot chef by asking it to build 10 different large-scale software "restaurants" (like a social media app, an online store, or a learning tool). They wanted to see two things:

  1. Did the food taste good? (Does the software actually work?)
  2. Is the kitchen built to last? (Is the design messy, or is it a solid, maintainable structure?)

Here is what they found, broken down into simple concepts.

1. The Secret Sauce: The "FD-HITL" Framework

The researchers realized that if you just yell, "Make me a restaurant!" to the robot, it might panic or build a house instead. The robot needs a Project Manager.

They created a method called FD-HITL (Feature-Driven Human-In-The-Loop). Think of this as a Master Blueprint.

  • Instead of: "Build the whole thing now."
  • They did: "First, let's plan the menu. Okay, now build the kitchen plumbing. Great, now let's test the sink. Now build the stove. Test the stove."

By breaking the massive project into tiny, testable "features" and checking the robot's work at every step, they got amazing results.

2. The Good News: The Robot Can Cook!

When they used this "Blueprint" method, the robot (Cursor) was surprisingly successful.

  • The Scale: It built 10 massive projects, averaging about 17,000 lines of code each. That's like writing a short novel for every single app.
  • The Function: About 91% of the time, the apps actually worked! You could log in, post a photo, or buy an item, and it did what it was supposed to do.
  • The Verdict: Yes, AI IDEs can build large-scale software, but only if a human acts as the strict project manager, guiding them step-by-step.

3. The Bad News: The Kitchen is a Mess

Here is the twist. Just because the food tastes good doesn't mean the kitchen is built well. If you try to fix a leak in the sink later, you might have to tear down the whole wall because the pipes were installed haphazardly.

The researchers used two "Inspectors" (static analysis tools called CodeScene and SonarQube) to walk through the robot's kitchens. They found thousands of Design Issues.

The Top 5 "Messy Kitchen" Problems:

  1. The "Copy-Paste" Disaster (Code Duplication):

    • The Metaphor: The robot wrote the same recipe for "Spaghetti" three times in three different notebooks. If you want to change the sauce, you have to edit three different places. If you miss one, the dish tastes wrong.
    • The Rule Broken: DRY (Don't Repeat Yourself).
  2. The "Swiss Army Knife" Methods (Large/Complex Methods):

    • The Metaphor: The robot created one giant function called DoEverything() that handles logging in, calculating taxes, printing receipts, and sending emails. It's a 200-line monster. It's impossible to understand, test, or fix without breaking something else.
    • The Rule Broken: SRP (Single Responsibility Principle) – One job per function.
  3. The "Labyrinth" (High Complexity):

    • The Metaphor: The code is like a maze with 100 turns. To understand why the app crashed, a human has to trace a path through 15 different "if/else" doors. It's exhausting and confusing.
    • The Rule Broken: KISS (Keep It Simple, Stupid).
  4. The "Broken Rules" (Framework Violations):

    • The Metaphor: The robot built a house using a hammer to drive in nails, but it used the hammer sideways because it didn't know how to use a nail gun properly. It followed the idea of the technology but missed the specific best practices (like how to handle errors or validate data).
  5. The "Inaccessible Door" (Accessibility Issues):

    • The Metaphor: The robot built a beautiful door, but it's too high for a wheelchair user to reach, and there's no handle for someone with no fingers. The app works for the robot, but it's unusable for many real people.

4. The Overlap: The "Critical" Mess

Interestingly, when both inspectors (CodeScene and SonarQube) looked at the same code, they only agreed on about 133 specific issues. But guess what? These were the most dangerous ones. They were the "Critical Severity" problems, mostly related to how complex and messy the code was.

5. The Big Takeaway: The Robot is a Junior Intern, Not a Senior Architect

The paper concludes that AI IDEs are powerful, but they are not ready to replace senior engineers.

  • What they are good at: Writing the bricks, laying the mortar, and following instructions to build the walls fast.
  • What they are bad at: Understanding the big picture, ensuring the building won't collapse in 5 years, and following the subtle rules of architecture.

The Human Role:
Humans need to stop being "coders" and start being Architects and Managers.

  • Don't just say "Build it."
  • Say: "Here is the plan. Build this one small room. Check it. Now build the next room."
  • You must review the work constantly. If you let the robot run wild (a style called "Vibe Coding"), it will build a house that looks great from the outside but falls apart if you try to add a window later.

Summary

AI can now write massive amounts of code, but it tends to write messy, repetitive, and hard-to-maintain code. To use it effectively, you need a human in the loop to act as a strict project manager, breaking big tasks into small pieces and constantly checking the "blueprints." The robot is a fantastic worker, but it still needs a human boss to ensure the final product is a skyscraper, not a house of cards.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →