Class Model Generation from Requirements using Large Language Models

Imagine you are the architect of a massive, futuristic city. Before you can build a single brick, you need a blueprint. In the software world, these blueprints are called UML Class Diagrams. They are complex drawings that show how different parts of a computer program fit together, talk to each other, and what rules they follow.

Traditionally, creating these blueprints is like hand-drawing a map of a galaxy: it takes a long time, requires a PhD in "space-geography" (software engineering), and if you make a tiny mistake, the whole city could collapse.

This paper asks a simple but revolutionary question: Can a super-smart AI (a Large Language Model) look at a messy, written description of the city and instantly draw the perfect blueprint for us?

Here is the story of how they tested this, explained in everyday terms.

1. The Cast of Characters

The researchers gathered four "AI Architects" to compete:

GPT-5: The veteran master builder.
Claude Sonnet 4.0: The meticulous detail-orientated planner.
Gemini 2.5: The fast-thinking creative.
Llama-3.1: The open-source, community-built apprentice.

They gave these AIs eight different "job descriptions" (requirements) ranging from a Recycling System to a Medical Pacemaker. The AIs had to read the text and output a blueprint.

2. The "Judge" Problem

Here's the tricky part: In the real world, there is no "Answer Key." We don't have a perfect blueprint to compare the AI's work against. So, how do you know if the AI did a good job?

The researchers invented a clever solution: The AI Judge.

They brought in two other AIs (named Grok and Mistral) to act as the "Inspectors."

The Setup: The four Architect AIs drew their blueprints.
The Inspection: The two Inspector AIs looked at the drawings side-by-side. They didn't just say "Good" or "Bad." They acted like strict art critics, grading the drawings on five specific things:
1. Completeness: Did they forget any buildings?
2. Correctness: Do the roads connect logically?
3. Standards: Is the drawing using the right symbols?
4. Clarity: Can a normal person understand it?
5. Vocabulary: Did they use the same words as the job description?

3. The Results: Who Won?

The competition was fierce, but GPT-5 (the veteran) consistently drew the best blueprints. It was like the master architect who never missed a detail.

Claude came in a strong second.
Gemini was okay but made some weird connections.
Llama struggled the most, often drawing maps that didn't make sense.

The Big Surprise: The two "Inspector" AIs (Grok and Mistral) agreed with each other almost perfectly! They were like two senior inspectors walking a construction site and nodding in unison. This proved that AI can actually grade AI work reliably.

4. The Human Reality Check

To be absolutely sure, the researchers brought in two real human experts (actual software architects) to grade the best AI's work. They wanted to see if the AI Inspectors were hallucinating or if they were actually seeing what humans saw.

The Verdict: The AI Inspectors and the Human Experts were on the same page.

When the humans said, "This blueprint is clear and accurate," the AI said, "Yes, 5 out of 5 stars."
When the humans said, "This part is confusing," the AI agreed.

The only time they disagreed slightly was on "subjective" things, like how "pretty" or "easy to read" a diagram was. But for the hard facts (logic, structure, rules), the AI was just as sharp as the human.

5. Why This Matters (The Analogy)

Think of this like a cooking competition.

Before: You had to hire a famous chef to taste every dish and tell you if it was good. It was slow and expensive.
Now: You have a "Super-Taster Robot" that can taste a dish, compare it to a recipe, and tell you exactly what's missing (too much salt, undercooked meat) with the same accuracy as a human chef.

The Takeaway

This paper proves that we are entering a new era where:

AI can build: It can turn messy text requirements into structured software blueprints.
AI can check: It can review those blueprints and tell us if they are good, without needing a human to do the heavy lifting first.

The Catch: While the AI is great at the basics, complex, weird, or highly specialized jobs (like the "Pacemaker" example) still need a human expert to give the final "thumbs up."

In short: AI is no longer just a tool that writes code; it's becoming a collaborator that can design, build, and even grade its own work, saving us hours of tedious drawing and checking.

Here is a detailed technical summary of the paper "Class Model Generation from Requirements using Large Language Models":

1. Problem Statement

The generation of Unified Modeling Language (UML) class diagrams from natural language (NL) requirements is a critical yet resource-intensive phase in software engineering. Traditionally, this process requires significant manual effort, domain expertise, and often leads to misunderstandings between requirements engineers and stakeholders. While Generative AI (GenAI) and Large Language Models (LLMs) offer potential for automation, two fundamental gaps remain:

Generation Capability: Can state-of-the-art LLMs reliably generate structurally coherent and semantically accurate UML class diagrams from diverse NL requirements?
Evaluation Capability: In the absence of "ground truth" reference models (common in real-world scenarios), can LLMs effectively evaluate and rank the quality of generated diagrams to serve as automated judges?

2. Methodology

The authors propose a comprehensive dual-validation framework that integrates an LLM-as-a-Judge methodology with Human-in-the-Loop (HITL) assessment.

A. Experimental Setup

Generators: Four state-of-the-art LLMs were tested: GPT-5, Claude Sonnet 4.0, Gemini 2.5 Flash Thinking, and Llama-3.1-8B-Instruct.
Datasets: Eight heterogeneous datasets were used, covering diverse domains (e.g., Healthcare, Cyber-physical Systems, Inventory, Medical Devices) and requirement types (User Stories and "Shall" requirements).
Prompting Strategy: A Chain-of-Thought (CoT) prompting pattern was employed. The models were instructed to:
1. Extract entities, roles, and packages.
2. Define attributes (with types) and methods.
3. Determine inheritance, interfaces, and associations with multiplicities.
4. Perform a syntax sanity check.
5. Output strictly formatted PlantUML code.

B. Evaluation Framework

The study utilized a five-dimensional quality rubric to assess the generated diagrams:

Completeness: Coverage of requirements.
Correctness: Logical accuracy and reflection of requirements.
Adherence to Standards: Syntactic and semantic correctness (PlantUML compilability).
Comprehensibility: Clarity for stakeholders.
Terminological Alignment: Consistency with requirement wording.

C. Validation Process

RQ1 (LLM-as-a-Judge): Two independent LLM judges (Grok and Mistral Small 3.1) performed pairwise comparisons of the diagrams generated by the four candidate models. They ranked the diagrams based on the five criteria.
RQ2 (Human-in-the-Loop): The best-performing generator (identified in RQ1) was evaluated by two independent human experts (A1 and A2) using an absolute scoring rubric.
Statistical Analysis: The study employed multiple statistical measures to ensure robustness:
- Spearman Rank Correlation ( $\rho$ ): To measure ranking consistency between judges.
- Cohen's Kappa ( $\kappa$ ): To measure categorical agreement (Acceptable vs. Unacceptable).
- Cohen's $d$ : To quantify the effect size of score differences.
- Wilcoxon Signed-Rank Test: To verify if scores significantly exceeded a neutral threshold.

3. Key Contributions

Dual-Validation Framework: A novel approach combining LLM-based pairwise ranking with human expert absolute scoring to validate both generation and evaluation capabilities without ground truth.
Comprehensive Benchmarking: A systematic comparison of four leading LLMs across eight diverse real-world datasets, providing empirical data on their specific strengths and weaknesses in software modeling.
Validation of LLM-as-a-Judge: Evidence demonstrating that LLMs can serve as reliable evaluators for technical artifacts (UML diagrams), achieving high agreement with human experts.
Reproducible Artifact: The study provides a structured prompt engineering strategy (CoT) and a detailed evaluation rubric that can be adapted for other modeling languages.

4. Key Results

Generation Performance:
- GPT-5 consistently outperformed all other models, ranking first in 7 out of 8 datasets.
- Claude Sonnet 4.0 ranked second, while Gemini and Llama trailed significantly.
- Generated diagrams generally captured core domain classes but occasionally suffered from missing associations, incorrect multiplicities, or redundancy.
Evaluation Consistency (RQ1):
- The two LLM judges (Grok and Mistral) showed substantial agreement in their pairwise rankings ( $\kappa = 0.773$ ).
- Spearman correlation was strong ( $\rho \in [0.8, 1.0]$ ) for 7 of the 8 datasets. The Pacemaker dataset showed lower correlation ( $\rho=0.2$ ), likely due to high domain complexity.
- Statistical tests confirmed that judges systematically rated diagrams above the "acceptable" quality threshold.
Human-LLM Alignment (RQ2):
- Human experts and LLM judges showed substantial agreement ( $\kappa = 0.684$ for humans; $\kappa = 0.722$ for LLM vs. Human consensus).
- Mean scores were highly aligned across all criteria.
- Discrepancies: LLMs tended to assign slightly higher scores for "Completeness" and "Correctness," while "Understandability" and "Terminological Alignment" showed larger effect sizes ( $d > 0.6$ ), indicating more subjective interpretation differences.

5. Significance and Implications

Automation Potential: The study validates that LLMs can significantly reduce the manual burden of creating UML diagrams, making requirements engineering more accessible to non-technical stakeholders.
Reliable Automated Evaluation: The findings suggest that LLMs can act as effective "first-pass" evaluators for generated models, reducing the cognitive load on human reviewers. This enables scalable quality assurance in software design workflows.
Human-AI Collaboration: The paper proposes a hybrid workflow where LLMs handle initial generation and preliminary assessment, while human experts provide final validation, particularly for complex, domain-intensive models.
Limitations & Future Work: The study notes that domain-specific complexity (e.g., medical devices) remains a challenge. Future work will explore Retrieval-Augmented Generation (RAG) to improve accuracy and expand the framework to other modeling languages.

In conclusion, the paper establishes that current LLMs are not only capable of generating high-quality UML class diagrams from natural language but can also reliably evaluate these artifacts, offering a robust foundation for automated software engineering workflows.