Beyond Functional Correctness: Design Issues in AI… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you've hired a super-fast, incredibly talented robot chef named Cursor. This robot doesn't just chop a single onion; it can cook an entire banquet, from the appetizers to the dessert, all on its own.

For a long time, we thought these AI chefs could only make a single perfect cookie (a code snippet). But recently, people started asking: "Can this robot actually build a whole restaurant kitchen, complete with plumbing, electrical wiring, and a menu, without us holding its hand every second?"

This paper is the story of a team of researchers who decided to test this robot chef by asking it to build 10 different large-scale software "restaurants" (like a social media app, an online store, or a learning tool). They wanted to see two things:

Did the food taste good? (Does the software actually work?)
Is the kitchen built to last? (Is the design messy, or is it a solid, maintainable structure?)

Here is what they found, broken down into simple concepts.

1. The Secret Sauce: The "FD-HITL" Framework

The researchers realized that if you just yell, "Make me a restaurant!" to the robot, it might panic or build a house instead. The robot needs a Project Manager.

They created a method called FD-HITL (Feature-Driven Human-In-The-Loop). Think of this as a Master Blueprint.

Instead of: "Build the whole thing now."
They did: "First, let's plan the menu. Okay, now build the kitchen plumbing. Great, now let's test the sink. Now build the stove. Test the stove."

By breaking the massive project into tiny, testable "features" and checking the robot's work at every step, they got amazing results.

2. The Good News: The Robot Can Cook!

When they used this "Blueprint" method, the robot (Cursor) was surprisingly successful.

The Scale: It built 10 massive projects, averaging about 17,000 lines of code each. That's like writing a short novel for every single app.
The Function: About 91% of the time, the apps actually worked! You could log in, post a photo, or buy an item, and it did what it was supposed to do.
The Verdict: Yes, AI IDEs can build large-scale software, but only if a human acts as the strict project manager, guiding them step-by-step.

3. The Bad News: The Kitchen is a Mess

Here is the twist. Just because the food tastes good doesn't mean the kitchen is built well. If you try to fix a leak in the sink later, you might have to tear down the whole wall because the pipes were installed haphazardly.

The researchers used two "Inspectors" (static analysis tools called CodeScene and SonarQube) to walk through the robot's kitchens. They found thousands of Design Issues.

The Top 5 "Messy Kitchen" Problems:

The "Copy-Paste" Disaster (Code Duplication):
- The Metaphor: The robot wrote the same recipe for "Spaghetti" three times in three different notebooks. If you want to change the sauce, you have to edit three different places. If you miss one, the dish tastes wrong.
- The Rule Broken: DRY (Don't Repeat Yourself).
The "Swiss Army Knife" Methods (Large/Complex Methods):
- The Metaphor: The robot created one giant function called DoEverything() that handles logging in, calculating taxes, printing receipts, and sending emails. It's a 200-line monster. It's impossible to understand, test, or fix without breaking something else.
- The Rule Broken: SRP (Single Responsibility Principle) – One job per function.
The "Labyrinth" (High Complexity):
- The Metaphor: The code is like a maze with 100 turns. To understand why the app crashed, a human has to trace a path through 15 different "if/else" doors. It's exhausting and confusing.
- The Rule Broken: KISS (Keep It Simple, Stupid).
The "Broken Rules" (Framework Violations):
- The Metaphor: The robot built a house using a hammer to drive in nails, but it used the hammer sideways because it didn't know how to use a nail gun properly. It followed the idea of the technology but missed the specific best practices (like how to handle errors or validate data).
The "Inaccessible Door" (Accessibility Issues):
- The Metaphor: The robot built a beautiful door, but it's too high for a wheelchair user to reach, and there's no handle for someone with no fingers. The app works for the robot, but it's unusable for many real people.

4. The Overlap: The "Critical" Mess

Interestingly, when both inspectors (CodeScene and SonarQube) looked at the same code, they only agreed on about 133 specific issues. But guess what? These were the most dangerous ones. They were the "Critical Severity" problems, mostly related to how complex and messy the code was.

5. The Big Takeaway: The Robot is a Junior Intern, Not a Senior Architect

The paper concludes that AI IDEs are powerful, but they are not ready to replace senior engineers.

What they are good at: Writing the bricks, laying the mortar, and following instructions to build the walls fast.
What they are bad at: Understanding the big picture, ensuring the building won't collapse in 5 years, and following the subtle rules of architecture.

The Human Role:
Humans need to stop being "coders" and start being Architects and Managers.

Don't just say "Build it."
Say: "Here is the plan. Build this one small room. Check it. Now build the next room."
You must review the work constantly. If you let the robot run wild (a style called "Vibe Coding"), it will build a house that looks great from the outside but falls apart if you try to add a window later.

Summary

AI can now write massive amounts of code, but it tends to write messy, repetitive, and hard-to-maintain code. To use it effectively, you need a human in the loop to act as a strict project manager, breaking big tasks into small pieces and constantly checking the "blueprints." The robot is a fantastic worker, but it still needs a human boss to ensure the final product is a skyscraper, not a house of cards.

1. Problem Statement

The integration of Large Language Models (LLMs) into software development has evolved from generating code snippets (e.g., GitHub Copilot) to AI-powered IDEs with agentic capabilities (e.g., Cursor, Claude Code) that can generate entire projects within the development context. While previous research has evaluated the functional correctness of AI-generated code snippets or small-scale projects, there is a significant gap in understanding:

Scale: Can current AI IDEs generate large-scale software systems (industrial complexity, >8k Lines of Code) rather than simple prototypes?
Design Quality: Beyond "does it run?", what is the design quality of these large-scale systems? Specifically, do they exhibit structural flaws, code smells, or violations of design principles that threaten long-term maintainability and evolvability?

Existing literature often relies on "ad hoc" prompting (e.g., "Vibe Coding"), which lacks systematic decomposition, leading to limited success in complex scenarios.

2. Methodology

The authors conducted an empirical study using Cursor (a popular AI IDE) to generate 10 large-scale projects. The study followed a rigorous, multi-phase methodology:

A. The Feature-Driven Human-In-The-Loop (FD-HITL) Framework

To address the limitations of ad hoc prompting, the authors proposed a systematic framework based on Feature-Driven Development (FDD). This framework guides the AI through four phases:

Project Initialization: Defining the business context and selecting a technology stack collaboratively.
Requirements & Design: Generating requirements.md and tasklist.md files. The AI decomposes the project into independently testable features with explicit acceptance criteria.
Implementation: A feature-driven cycle where the AI implements backend/database tasks first, followed by frontend integration. The human provides continuous feedback, debugging, and validation at each step.
System-Wide Review: Final manual testing and system enhancement to ensure all requirements are met.

B. Data Collection

Dataset: 10 projects across three domains: Web (4), Mobile (2), and Utility Tools (4).
Technologies: Diverse stacks including MERN (MongoDB, Express, React, Node), Spring Boot + React Native, Django, Vue.js, and WordPress.
Inclusion Criteria: Projects must have $\ge$ 8,000 Lines of Code (LoC), $\ge$ 3 technologies, complex architecture (e.g., client-server), and external dependencies.
Scale: The average project size was 16,965 LoC across 114 files.

C. Evaluation Process

Functional Correctness: Manual human evaluation against the generated requirements.md to verify executability and feature completion.
Design Issue Detection: Two static analysis tools were employed:
- CodeScene: For high-level design issues (complexity, duplication).
- SonarQube: For low-level code smells, security, and best-practice violations.
Filtering: Manual verification to remove 1,612 false positives from SonarQube (e.g., issues contradicting specific framework documentation like WordPress naming conventions).
Analysis: Quantitative counting and qualitative thematic analysis to categorize issues and map them to design principles (SRP, DRY, KISS, SoC).

3. Key Contributions

FD-HITL Framework: A novel, systematic workflow for generating large-scale AI projects, demonstrating that structured human oversight is essential for scaling AI generation.
DIinAGP Dataset: A curated public dataset containing 10 project descriptions, the corresponding Cursor-generated code (approx. 170k LoC), and the identified design issues.
Empirical Evidence on Design Quality: The first large-scale study quantifying the specific design flaws in AI IDE-generated systems, moving beyond snippet-level analysis.
Practitioner Guidelines: Actionable recommendations for developers and organizations adopting AI IDEs to mitigate design risks.

4. Key Results

A. Functional Correctness (RQ1)

Success Rate: Using the FD-HITL framework, Cursor achieved an average functional correctness of 91% across the 10 projects.
Capability: Cursor successfully generated complex, multi-file systems with diverse technology stacks, proving it can handle large-scale generation when guided systematically.
Limitations: The 9% failure rate consisted of missing requirements or logical errors (e.g., broken post-update functionality), highlighting that human review is still necessary.

B. Design Issues (RQ2)

Despite high functional correctness, the projects contained significant design debt:

Volume:
- CodeScene: Identified 1,305 design issues (9 categories).
- SonarQube: Identified 3,193 valid design issues (11 categories) after filtering.
- Overlap: 133 issues were detected by both tools, all rated as Critical severity.
Top Design Issues:
1. Code Duplication (28.4% of CodeScene issues): Violation of the DRY principle. Common in frontend logic (e.g., repetitive quiz generation methods).
2. Code Complexity:
  - Complex Methods (27.9%): High cyclomatic complexity (mean ~17), often violating SRP.
  - Large Methods (12.6%): Methods exceeding 100+ LoC (mean 171 LoC), making testing difficult.
  - Overall Code Complexity: Files with excessive nesting and logic.
3. Framework Best-Practice Violations (35.3% of SonarQube issues): E.g., missing React PropTypes, using System.out instead of loggers, or improper dependency injection.
4. Exception Handling (10.4%): Generic exception catching or empty catch blocks, violating the Fail Fast principle.
5. Accessibility Issues (6.1%): Missing ARIA labels, non-native interactive elements, and lack of keyboard navigation support.
Design Principle Violations: The issues frequently violated Single Responsibility Principle (SRP), Separation of Concerns (SoC), Don't Repeat Yourself (DRY), and Keep It Simple, Stupid (KISS).
Technology Specifics: 59% of SonarQube issues were technology-specific. The majority (64%) were related to the JavaScript/React/Node.js ecosystem (e.g., missing prop validation, improper global variable usage).

5. Significance and Implications

For Practitioners

Shift in Human Effort: Developers should focus on high-level tasks (requirements engineering, architectural decomposition, feature definition) and delegate low-level implementation to AI.
Systematic Workflow: "Vibe Coding" (ad hoc prompting) is insufficient for large projects. A structured, feature-driven approach with iterative testing is required.
Quality Assurance: Functional correctness is not enough. Teams must integrate static analysis and accessibility checks into the AI generation pipeline to catch design debt early.
Backend-First Strategy: Generating and validating backend/database logic before frontend integration helps isolate logical errors.

For Researchers

Beyond Snippets: Future research must focus on end-to-end project generation and system-level design quality, not just code snippet correctness.
Tooling: There is a need for AI-assisted design review tools that can detect architectural flaws and suggest refactoring specifically for AI-generated code.
Behavioral Studies: Understanding how developers interact with AI IDEs (e.g., over-reliance, lack of code review) is crucial for improving tool design.

Conclusion

The paper concludes that while AI IDEs like Cursor can generate functional, large-scale software systems when guided by a systematic framework (FD-HITL), the resulting code often suffers from significant design flaws. These flaws pose long-term risks to maintainability and evolvability. Therefore, AI IDEs should be viewed as powerful co-pilots that accelerate development velocity but cannot yet replace the engineering judgment required for high-level design, architecture, and rigorous code review.

Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects