OODEval: Evaluating Large Language Models on Object-Oriented Design

Imagine you are hiring a team of AI architects to design the blueprints for a new building. In the software world, these blueprints are called Object-Oriented Designs (OOD). They aren't just lines of code; they are the structural plans showing how different parts of a system (like "Users," "Orders," and "Payments") connect and talk to each other.

For a long time, we've been testing these AI architects on how well they can lay bricks (write code). But nobody really checked if they could actually design a stable, logical building.

This paper, "OODEval," is like a new, rigorous architectural licensing exam designed specifically to test these AI architects on their design skills. Here is the breakdown in simple terms:

1. The Problem: We Had No Exam

Previously, if you wanted to test an AI's design skills, it was like asking a chef to cook a meal but giving them no ingredients and no recipe.

No Standard Test: There was no agreed-upon set of design problems.
No Good Grading: Existing grading tools were like a spell-checker. They could tell if the words were spelled right (syntax), but they couldn't tell if the building made sense (semantics). An AI could write a perfect sentence saying "The kitchen is on the roof," and the spell-checker would say "Great!" even though that's a terrible design.

2. The Solution: OODEval (The New Exam)

The researchers built a brand new testing ground called OODEval.

The Test Questions: They created 50 real-world design challenges, ranging from "Design a simple coffee shop system" (Easy) to "Design a complex banking system with thousands of connections" (Hard).
The Human Control Group: To see how the AI compares to real people, they gathered 940 actual blueprints drawn by undergraduate students and graded them with instructor scores. This is the "Human Benchmark."
The New Grader (CLUE): They invented a new grading metric called CLUE (Class Likeness Unified Evaluation).
- Analogy: Imagine a spell-checker that also understands architecture. CLUE doesn't just check if the words are right; it checks if the "Kitchen" is actually connected to the "Dining Room" and if the "Foundation" supports the "Roof." It gives a score based on how close the AI's design is to the perfect human design.

3. The Results: The AI Report Card

The researchers tested 29 different AI models (like GPT-4, Llama, and Qwen) on this exam. Here is what they found:

🏆 The Good News: They Can Write Perfect Sentences

The AIs are amazing at syntax. If you ask them to draw a blueprint, they almost always follow the rules of the drawing language perfectly. They rarely make "grammar" mistakes.

Metaphor: They can draw a perfect circle with a ruler.

📉 The Bad News: They Don't Understand the Logic

The AIs struggle with semantics (the meaning).

The "Method" Problem: They are good at naming rooms (Classes) but terrible at describing what happens inside them (Methods). It's like an architect who draws a perfect "Kitchen" but forgets to include a stove or a sink.
The "Relationship" Problem: They get confused about how things connect. They might connect a "Car" to a "Pizza" instead of a "Garage."
The Difficulty Curve: The harder the design gets, the worse the AIs do. Simple tasks? They ace it. Complex, real-world systems? They start hallucinating (making things up).

🥊 AI vs. Humans

Average AI vs. Average Student: The average AI is currently worse than the average college student.
Top AI vs. Average Student: The very best AIs (like Qwen3-Coder-30B) are now performing almost as well as the average student. They are getting close to passing the exam!
Top AI vs. Top Student: However, the best AI is still far behind the best human experts. The top students can still design things the AI simply cannot figure out yet.

4. Who Won the Exam?

The Champion: Qwen3-Coder-30B (a local, open-source model) took the top spot. It was the most balanced and reliable.
The Surprise: Gemma3-4B-IT (a very small model) punched way above its weight class, beating much larger, expensive models like GPT-4o-mini. This suggests you don't always need a massive supercomputer to get good design results.
The Losers: Some older models (like Llama 2) failed miserably, scoring near zero.

5. Why Does This Matter? (The Takeaway)

For Developers: If you want to use AI to help design software, don't just trust it blindly. It's great at the basics but needs a human to check the logic, especially for complex relationships.
For Teachers: This is a wake-up call. Since top AIs can now design at the level of an average student, students could use them to cheat on homework. Teachers need to change how they test students (e.g., asking them to explain their design orally) rather than just grading the final blueprint.
For AI Researchers: The next big step isn't just making the AI "smarter" generally; it's specifically teaching it how to understand relationships and complex logic, not just how to write code.

In a nutshell: AI has learned to draw the lines perfectly, but it's still learning how to think like an architect. We have a new tool (OODEval) to measure exactly how far it has to go.

Here is a detailed technical summary of the paper "OODEval: Evaluating Large Language Models on Object-Oriented Design."

1. Problem Statement

While Large Language Models (LLMs) have been extensively evaluated on code generation and software engineering tasks, their capabilities in Object-Oriented Design (OOD)—specifically the ability to analyze natural language requirements and generate UML class diagrams—remain underexplored. Existing research suffers from three main limitations:

Lack of Standardized Benchmarks: Existing datasets are often small, private, image-based (hindering automation), or lack difficulty stratification.
Inadequate Evaluation Metrics: Traditional metrics (e.g., BLEU, ROUGE) fail to capture the structural and semantic nuances of class diagrams. Embedding-based metrics often miss fine-grained local details, and domain-specific methods lack reproducibility or comprehensive coverage.
Limited Empirical Scope: Previous studies rely on small-scale case studies (often $\le$ 10 examples) and lack systematic comparisons between LLMs and human designers.

2. Methodology

The authors propose a comprehensive evaluation framework comprising three core components: a benchmark dataset, a human-rated dataset, and a novel automatic evaluation metric.

A. OODEval Benchmark

Construction: A manually curated dataset of 50 OOD tasks ranging from requirements to class diagram designs.
Sources: Derived from GitHub (Mermaid/Markdown), academic literature, and university course archives (post-2025 to prevent data leakage).
Stratification: Tasks are categorized into three difficulty levels (Simple, Moderate, Hard) based on metrics like the number of classes, attributes, methods, relationships, and requirement readability (Flesch-Kincaid score).
Format: Provided in PlantUML code (for automated parsing) and PNG (for visualization), with metadata extracted for fine-grained analysis.

B. OODEval-Human Dataset

Purpose: To establish a baseline for human performance and validate the new evaluation metrics.
Content: Contains 940 undergraduate student submissions across 4 OOD tasks, each graded by instructors.
Processing: Student image submissions were converted to PlantUML via GPT-4o and rigorously manually verified to ensure consistency with the original diagrams.

C. CLUE Metrics (Class Likeness Unified Evaluation)

Concept: A novel automatic metric suite designed to measure structural and semantic similarity between a candidate class diagram and a reference diagram.
Mechanism:
- Uses CodeBERT for embedding-based semantic similarity of strings (names, types).
- Employs the Hungarian Algorithm for optimal matching of sets (classes, attributes, methods) to handle lexical variations (e.g., "Teacher" vs. "Tutor").
- Includes specific heuristics for relationship multiplicity (e.g., 1..*, 0..1).
Granularity: Computes five scores:
1. clue: Overall diagram similarity.
2. clue-class: Class name/structure similarity.
3. clue-attribute: Attribute similarity.
4. clue-method: Method signature and parameter similarity.
5. clue-relation: Inter-class relationship similarity.
Optimization: Weights for the metric components were optimized via Bayesian Optimization to maximize correlation with human instructor ratings on OODEval-Human.

D. Empirical Study

Subjects: Evaluated 29 diverse LLMs (including Llama, CodeLlama, Gemma, Qwen, DeepSeek, GPT series) covering various parameter scales, code-specialization, instruction tuning, and reasoning capabilities.
Metrics:
- Syntactic Correctness: Measured via Pass@1 (PlantUML parser validation).
- Semantic Correctness: Measured via CLUE scores.
Analysis: Investigated five Research Questions (RQs) covering overall correctness, human comparison, model dimensions, task features, and failure modes.

3. Key Contributions

OODEval Benchmark: The first standardized, multi-difficulty benchmark for OOD generation, publicly available with 50 tasks.
OODEval-Human: The first human-rated OOD dataset (940 solutions) enabling direct LLM-vs-Human comparison and metric validation.
CLUE Metrics: A robust, open-source automatic evaluation suite that integrates structural and semantic information, showing a strong Pearson correlation ( $\rho \approx 0.59$ ) with human ratings, outperforming BLEU, CodeBERTScore, and NINHS.
Comprehensive Empirical Study: The largest-scale evaluation of LLMs on OOD tasks to date, providing deep insights into model capabilities, failure modes, and the impact of task complexity.

4. Key Results & Findings

RQ1: Overall Correctness

Syntax vs. Semantics: LLMs exhibit high syntactic correctness (Pass@1 > 80% for top models) but significantly lower semantic correctness (CLUE scores are ~10-15% lower).
Weaknesses: Models struggle most with generating methods and relationships, especially in "Hard" difficulty tasks. Class names and attributes are generated more accurately.
Top Performers: Qwen3-Coder-30B achieved the highest overall score. Local open-source models (e.g., Gemma3-IT series, Llama3.1-70B) performed competitively against or better than closed-source online models (e.g., GPT-4o-mini).

RQ2: Comparison with Humans

Average Gap: The average LLM performance is substantially lower than the average human undergraduate.
Top LLM vs. Average Human: State-of-the-art LLMs (e.g., Qwen3-Coder-30B) approach the average human performance (within ~5% gap) but still lag behind the best human performers (who score >99%).
Implication: Top LLMs are now capable of producing designs comparable to an average student, raising concerns about academic integrity.

RQ3: Model Dimension Analysis

Parameter Scale: Larger models generally perform better.
Specialization: Code-specific LLMs significantly outperform general-purpose models.
Instruction Tuning: Instruction-tuned models consistently outperform their base counterparts.
Reasoning: Reasoning-enhanced models (e.g., DeepSeek-R1) did not show a clear advantage over non-reasoning models in this specific design task.

RQ4: Task Feature Analysis

Complexity Impact: Performance degrades as the number of classes, methods, and relationships increases.
Readability: Lower requirement readability (lower Flesch-Kincaid scores) negatively impacts method generation.
Attributes: The number of attributes has little impact on performance, confirming it is not a bottleneck.

RQ5: Bad Case Analysis

Failure Modes: The most common errors are Keyword Errors, Missing Classes, and Missing Associations.
Error Diversity: Top-performing models (DeepSeek series, GPT series, Gemma3-IT, Qwen3-Coder-30B) exhibit low error diversity (1–3 error types), whereas weaker models show high diversity (10+ types).
Zero-Error Models: Gemma3-IT variants (4B, 12B, 27B) and DeepSeek-R1 showed no errors in the sampled subset.

5. Significance and Implications

For Model Development: Highlights the need to improve LLMs' understanding of operational semantics and relationship structures. Suggests that fine-tuning large code-specific models is a viable path for improvement.
For Model Selection:
- High-Resource Local: Qwen3-Coder-30B is recommended for balanced, high-accuracy local deployment.
- Resource-Constrained: Gemma3-4B-IT is ideal for fine-tuning due to its small size and low error diversity.
- Production/API: DeepSeek-R1 offers a cost-effective, high-performance alternative to GPT-4o.
For Education: The findings suggest that current LLMs can generate designs comparable to average students. Educators must adapt by focusing on process-oriented assessments, oral defenses, and AI literacy to prevent academic misconduct.
For Benchmarking: Future benchmarks must include more complex designs with higher counts of methods and relationships to better stress-test LLM capabilities.

In conclusion, OODEval establishes a systematic foundation for evaluating LLMs in software design, revealing that while LLMs have mastered syntax, they still struggle with the semantic complexity of object-oriented relationships, though the gap with average human designers is rapidly closing.