Class Model Generation from Requirements using Large Language Models

This paper evaluates the capability of state-of-the-art Large Language Models to automatically generate UML class diagrams from natural language requirements, demonstrating their effectiveness and reliability through a comprehensive dual-validation framework that combines LLM-as-a-Judge assessments with human expert evaluations.

Jackson Nguyen, Rui En Koe, Fanyu Wang, Chetan Arora, Alessio Ferrari

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are the architect of a massive, futuristic city. Before you can build a single brick, you need a blueprint. In the software world, these blueprints are called UML Class Diagrams. They are complex drawings that show how different parts of a computer program fit together, talk to each other, and what rules they follow.

Traditionally, creating these blueprints is like hand-drawing a map of a galaxy: it takes a long time, requires a PhD in "space-geography" (software engineering), and if you make a tiny mistake, the whole city could collapse.

This paper asks a simple but revolutionary question: Can a super-smart AI (a Large Language Model) look at a messy, written description of the city and instantly draw the perfect blueprint for us?

Here is the story of how they tested this, explained in everyday terms.

1. The Cast of Characters

The researchers gathered four "AI Architects" to compete:

  • GPT-5: The veteran master builder.
  • Claude Sonnet 4.0: The meticulous detail-orientated planner.
  • Gemini 2.5: The fast-thinking creative.
  • Llama-3.1: The open-source, community-built apprentice.

They gave these AIs eight different "job descriptions" (requirements) ranging from a Recycling System to a Medical Pacemaker. The AIs had to read the text and output a blueprint.

2. The "Judge" Problem

Here's the tricky part: In the real world, there is no "Answer Key." We don't have a perfect blueprint to compare the AI's work against. So, how do you know if the AI did a good job?

The researchers invented a clever solution: The AI Judge.

They brought in two other AIs (named Grok and Mistral) to act as the "Inspectors."

  • The Setup: The four Architect AIs drew their blueprints.
  • The Inspection: The two Inspector AIs looked at the drawings side-by-side. They didn't just say "Good" or "Bad." They acted like strict art critics, grading the drawings on five specific things:
    1. Completeness: Did they forget any buildings?
    2. Correctness: Do the roads connect logically?
    3. Standards: Is the drawing using the right symbols?
    4. Clarity: Can a normal person understand it?
    5. Vocabulary: Did they use the same words as the job description?

3. The Results: Who Won?

The competition was fierce, but GPT-5 (the veteran) consistently drew the best blueprints. It was like the master architect who never missed a detail.

  • Claude came in a strong second.
  • Gemini was okay but made some weird connections.
  • Llama struggled the most, often drawing maps that didn't make sense.

The Big Surprise: The two "Inspector" AIs (Grok and Mistral) agreed with each other almost perfectly! They were like two senior inspectors walking a construction site and nodding in unison. This proved that AI can actually grade AI work reliably.

4. The Human Reality Check

To be absolutely sure, the researchers brought in two real human experts (actual software architects) to grade the best AI's work. They wanted to see if the AI Inspectors were hallucinating or if they were actually seeing what humans saw.

The Verdict: The AI Inspectors and the Human Experts were on the same page.

  • When the humans said, "This blueprint is clear and accurate," the AI said, "Yes, 5 out of 5 stars."
  • When the humans said, "This part is confusing," the AI agreed.

The only time they disagreed slightly was on "subjective" things, like how "pretty" or "easy to read" a diagram was. But for the hard facts (logic, structure, rules), the AI was just as sharp as the human.

5. Why This Matters (The Analogy)

Think of this like a cooking competition.

  • Before: You had to hire a famous chef to taste every dish and tell you if it was good. It was slow and expensive.
  • Now: You have a "Super-Taster Robot" that can taste a dish, compare it to a recipe, and tell you exactly what's missing (too much salt, undercooked meat) with the same accuracy as a human chef.

The Takeaway

This paper proves that we are entering a new era where:

  1. AI can build: It can turn messy text requirements into structured software blueprints.
  2. AI can check: It can review those blueprints and tell us if they are good, without needing a human to do the heavy lifting first.

The Catch: While the AI is great at the basics, complex, weird, or highly specialized jobs (like the "Pacemaker" example) still need a human expert to give the final "thumbs up."

In short: AI is no longer just a tool that writes code; it's becoming a collaborator that can design, build, and even grade its own work, saving us hours of tedious drawing and checking.