A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

Imagine you have a super-smart, incredibly well-read robot assistant (a Large Language Model or LLM) that can write computer code for you. If you ask it to write a standard recipe in English (like a Python program), it's usually fantastic because it has read millions of similar recipes before.

But what happens if you ask it to write a code in a very specific, rare language used only by engineers to check safety rules (like OCL or Alloy)? This is like asking the robot to write a recipe in a dialect spoken by only 50 people in the world. It hasn't read enough examples, so it starts guessing, making mistakes, or inventing rules that don't exist.

This paper is about building a quality control factory to test exactly how good these robots are at writing code in these rare languages, and how we can help them do better.

Here is the breakdown of their findings, using some everyday analogies:

1. The Problem: The "Specialist" vs. The "Generalist"

The researchers found that these AI robots are like generalist chefs.

Python (The Generalist): If you ask the robot to cook a standard dish (Python code), it's amazing. It knows the ingredients and the steps perfectly.
OCL/Alloy (The Specialists): If you ask it to cook a very specific, obscure dish (Constraint DSLs), it struggles. It might forget the ingredients (syntax errors) or serve a dish that tastes wrong (logic errors).

Why? Because the robot was trained on a massive library of books. It has read millions of Python books but only a few dozen books on OCL or Alloy. It's trying to guess the rules based on patterns it barely knows.

2. The Solution: A "Tasting Menu" Framework

The authors built a framework (a testing machine) to act as a strict food critic. This machine doesn't just taste the food; it checks two things:

Well-Formedness: "Is the food even edible?" (Does the code follow the grammar rules? Can the computer read it?)
Correctness: "Does it taste like what the customer ordered?" (Does the code actually solve the problem?)

They tested this machine on four different "chefs" (AI models): two famous ones (GPT-4o, GPT-4o-mini) and two open-source ones (DeepSeek, Llama).

3. The Key Findings (The "Tasting Notes")

A. The Chef Matters More Than the Recipe Card

You might think that if you give the robot a perfect, detailed instruction card (a Prompt), it will do a great job.

The Finding: It doesn't matter much how you write the instruction card. If the robot doesn't know the language (like OCL), even a perfect instruction card won't save it.
The Analogy: It's like giving a Michelin-star chef a perfect recipe for a dish they've never seen before. They might still mess it up because they don't know the local ingredients.
The Winner: The big, expensive "chefs" (GPT-4o) did much better than the smaller, free ones. The smaller ones often couldn't even hold the whole "recipe" in their memory (context window) because the instructions were too long.

B. The "Batch Cooking" vs. "One-by-One"

When you need to write 10 different safety rules, should you ask the robot to write all 10 at once, or one by one?

The Finding: Asking for all 10 at once (Batch) is usually better.
The Analogy: If you ask a writer to write 10 chapters of a story separately, they might forget the name of the main character in Chapter 3 when they get to Chapter 7. But if they write them all in one sitting, they remember the whole story better.

C. The Magic of "Do-overs" and "Fix-it"

This was the most exciting part. The researchers tried two tricks:

Multiple Attempts: Ask the robot to try writing the code 3 times and pick the best one.
Code Repair: If the robot writes bad code, show it the error message and say, "Hey, this is broken. Fix it."

The Finding: Both tricks worked wonders!
- Multiple Attempts: Like flipping a coin. If you flip it once, you might get tails. If you flip it 3 times, you are much more likely to get at least one heads.
- Code Repair: This is like a teacher correcting a student's homework. The student makes a mistake, the teacher points it out, and the student fixes it. This boosted the success rate by 10-20%.
- The Best Combo: Doing both (trying 3 times, and fixing the bad ones) gave the best results, though it cost more money and time.

4. The Big Takeaway for Humans

If you want to use AI to write code for a rare, specialized language:

Don't waste time tweaking your instructions too much. It won't help much if the AI doesn't know the language.
Pick the right AI. Use the big, powerful models, not the small free ones.
Be patient. Don't just ask once. Ask it to try a few times, and if it fails, ask it to fix the error. It's like hiring a junior employee and giving them a second chance to get it right.

Summary

The paper is essentially a guidebook for using AI in the "wilderness" of specialized programming languages. It tells us that while AI is great at general tasks, it needs a little extra help (more attempts, error correction, and powerful models) to handle the tricky, rare jobs. The authors built a tool to measure this help, proving that repetition and correction are the keys to success.

Here is a detailed technical summary of the paper "A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models."

1. Problem Statement

Large Language Models (LLMs) have revolutionized software development by generating code from natural language specifications. However, their effectiveness drops significantly when applied to Domain-Specific Languages (DSLs), particularly constraint languages like OCL (Object Constraint Language) and Alloy.

The paper identifies four specific challenges in generating code for these languages:

Low-Resource Nature: Unlike general-purpose languages (GPLs) like Python, DSLs lack vast public training datasets, leading to imprecise LLM knowledge of syntax and semantics.
Dual-Task Complexity: Generating constraint code requires managing two distinct elements simultaneously: the domain model (schema) and the constraint itself. The LLM must understand how the constraint interacts with the schema.
Declarative Nature: Constraints are declarative rather than procedural, making execution difficult; they require verification against positive/negative samples rather than simple runtime testing.
Global Scope: Constraints often interact globally, making it hard to generate them in isolation without causing unintended side-effects or conflicts.

There is a lack of systematic frameworks to evaluate the quality (well-formedness and correctness) of LLM-generated code for these specific languages, hindering the adoption of AI in rigorous software engineering tasks.

2. Methodology

The authors propose a modular, configurable framework for evaluating LLM-based code generation. The framework is designed to be language-agnostic but is instantiated for OCL, Alloy, and Python.

Framework Architecture

The framework operates in four main stages (illustrated in Figure 1 of the paper):

Prompt Building (Block A):
- Inputs: Natural language coding tasks, domain descriptions, and formal domain models.
- Strategies: The framework tests various Prompt Templates (PT) combining:
  - Augmentation: Chain of Reasoning (CoR), explaining the domain model, or generating the domain model.
  - Task Delivery: Batch (all tasks in one prompt), Chained (sequential), or Isolated (single task).
Code Generation & Extraction (Block B):
- The LLM generates code based on the prompt.
- Supports Multiple Attempts (generating $k$ candidates per task) to mitigate non-determinism.
Well-Formedness Check (Block C):
- Validates syntactic correctness using parsers (ANTLR) or execution environments (USE for OCL, Alloy Analyzer).
- Code Repair: If a snippet fails, a single-pass repair process is triggered where the LLM receives the error message and attempts to fix the code.
Correctness Check (Block D):
- Validates semantic correctness against the specification.
- Evaluation Strategy: Uses an "LLM-as-a-Judge" approach (specifically GPT-4o-mini) to compare generated code against the natural language specification, supplemented by manual validation for a subset of data.
- Metrics: Uses Accuracy (pass@1) and Pass@k (probability that at least one of $k$ attempts is correct).

Experimental Setup

Languages: OCL, Alloy (DSLs) vs. Python (GPL).
LLMs Tested: DeepSeek Coder 6.7B, Llama 3.1, GPT-4o, GPT-4o-mini.
Datasets: Two datasets adapted from prior studies, enriched with synthetic natural language domain descriptions generated by an LLM.
Scale: Approximately 98,397 code generation tasks evaluated across various configurations.

3. Key Contributions

A Modular Evaluation Framework: A highly configurable system that parameterizes all aspects of the code generation process (prompt style, LLM choice, repair strategies, delivery methods), enabling systematic comparison of over 90,000 configurations.
Comprehensive Study on Constraint Languages: A focused analysis of the specific challenges in generating OCL and Alloy code, comparing them against Python to highlight the "low-resource" gap.
Empirical Validation of Strategies: The paper provides data-driven guidelines on how to optimize code generation, specifically quantifying the impact of prompt engineering, multiple attempts, and code repair.

4. Key Results

A. Impact of Target Language

Python significantly outperforms DSLs in both well-formedness (~~100%) and correctness (~~80%).
OCL and Alloy struggle with well-formedness due to niche operators (e.g., Alloy's subset operator) and lack of standard libraries (e.g., date handling), leading to hallucinations.
Conclusion: LLMs perform best on languages present in their training data; DSLs require specific alignment.

B. Impact of LLM Choice

GPT-4o and GPT-4o-mini (proprietary) vastly outperform open-source models (DeepSeek, Llama 3.1) for DSLs.
Open-source models often fail to generate syntactically valid code for OCL/Alloy (e.g., DeepSeek achieved <15% correctness for valid Alloy code).
GPT-4o showed superior performance on OCL compared to GPT-4o-mini, suggesting specific training exposure to OCL syntax.

C. Impact of Prompt Engineering

Surprising Finding: The choice of prompt template (PT) had minimal impact on correctness for most language/LLM combinations.
Statistical analysis (Chi-Square) showed no significant difference between complex prompts and simple ones for Python and Alloy.
Recommendation: Use the simplest prompt (PT1) to minimize token costs, as complex augmentation does not guarantee better results.

D. Impact of Delivery Strategy

Batch vs. Isolated: Generating multiple constraints in a single prompt (Batch) generally yielded comparable or better results than isolated generation.
Risk: Isolated generation risks inconsistency (e.g., one constraint assumes an attribute is a field, another assumes it is an association), making integration difficult.

E. Impact of Multiple Attempts and Code Repair

Multiple Attempts (Pass@k): Increasing attempts from 1 to 3 linearly improved correctness probability, though with diminishing returns.
Code Repair: Applying a single round of repair to incorrect code increased correctness by 10–20%.
Best Strategy: Combining Multiple Attempts + Code Repair yields the highest quality but at the highest computational cost.

5. Significance and Practical Guidelines

The paper shifts the focus from "which prompt is best" to "which strategy is most robust." It provides actionable guidelines for practitioners (Table 6):

Model Selection: Prioritize LLMs with proven alignment to the target language (especially for low-resource DSLs).
Prompting: Avoid over-engineering prompts if the model already knows the language; focus on validation and repetition.
Task Grouping: Submit multiple related constraints in a single prompt to ensure consistency.
Resource Allocation: If resources allow, use multiple generation attempts combined with code repair as the most effective strategy for high-quality output.

Conclusion

The study demonstrates that while LLMs are powerful for general-purpose code, their application to constraint DSLs requires careful management of the generation process. The proposed framework proves that iterative generation and repair are more effective than complex prompt engineering. The authors provide a reproducible open-source repository to facilitate further research in AI-assisted software engineering for formal methods.