Grounding Machine Creativity in Game Design Knowledge Representations: Empirical Probing of LLM-Based Executable Synthesis of Goal Playable Patterns under Structural Constraints

Imagine you have a brilliant idea for a new video game mechanic. You can describe it perfectly in words: "The player collects glowing orbs, and when they get ten, the boss gets angry and the level changes."

Now, imagine you want a robot to take that description and instantly build a working, playable game file for you. That is the dream of Computational Game Creativity.

This paper is a report card on how well our current "super-smart robots" (called Large Language Models or LLMs) are doing at this specific task. The short answer? They are trying very hard, but they are failing to build a single working game.

Here is the breakdown of their experiment, explained with simple analogies.

1. The Goal: Turning Blueprints into Buildings

The researchers wanted to see if an AI could take a "Game Design Pattern" (a fancy blueprint for how a game mechanic works) and turn it into actual code that runs in Unity (a popular game engine).

Think of it like this:

The Blueprint: A human-written description of a game rule (e.g., "The player must hide from the guard").
The Builder: An AI (the LLM).
The Building: A working Unity game project.

The challenge is that Unity is a very strict architect. If you miss one tiny screw or use the wrong type of brick, the whole building collapses and won't open.

2. The Experiment: Two Ways to Build

The researchers tested two different ways to ask the AI to build these games:

Method A (The "Just Tell Me" Approach): They gave the AI the game description in plain English and said, "Write the code."
- Analogy: You tell a contractor, "Build me a house with a red door," and they start hammering without looking at the blueprints or checking if they have the right nails.
Method B (The "Strict Blueprint" Approach): They forced the AI to first write a structured "Intermediate Representation" (IR)—a detailed, technical JSON list of every object, script, and connection needed—before writing the actual code.
- Analogy: You force the contractor to first fill out a 50-page form listing every single brick, wire, and pipe, and only then are they allowed to start building.

They used two different AI models (DeepSeek and Qwen) and 26 different game patterns. They ran this 4,160 times.

3. The Result: The "Zero Success" Rate

The bad news: 0% of the generated games worked. Not a single one compiled successfully. If you tried to open them in Unity, they would crash immediately.

The good news: The researchers didn't just say "It failed." They acted like forensic detectives to figure out exactly why it failed. They found two main types of mistakes:

Type 1: The "Hallucination" Failures (Grounding Errors)

The AI made up things that didn't exist.

The Metaphor: The AI is like a chef who is told to make a "Spicy Tofu Stir-fry." The chef confidently writes a recipe that includes "Tofu," "Chili," and "Unicorn Meat."
The Reality: The Unity game engine doesn't have "Unicorn Meat." The AI invented a game object or a code function that doesn't exist in the specific project it was working on. It knew the concept of a game, but it didn't know the specific ingredients available in the kitchen.

Type 2: The "Messy Kitchen" Failures (Hygiene Errors)

The AI got the ingredients right, but the kitchen was a disaster.

The Metaphor: The chef has the right ingredients, but they forgot to put the salt in the pot, they wrote the recipe on a napkin that was torn in half, or they tried to use a fork to chop vegetables.
The Reality: The code had syntax errors, missing semicolons, duplicate names, or formatting issues. The logic was there, but the "grammar" was broken.

4. The Surprising Twist: The "Strict Blueprint" Made It Worse

You might think that forcing the AI to fill out the strict JSON form (Method B) would help. Surprisingly, it made things worse in a specific way.

Without the form: The AI failed about 40-50% of the time due to "messy kitchen" errors.
With the form: The AI failed 96-99% of the time.

Why?
When the AI tried to follow the strict, complex JSON form, it got overwhelmed. It started writing code that was so structurally complex and long that the Unity engine simply gave up trying to read it. It was like asking a contractor to build a skyscraper based on a 1,000-page manual; by the time they finished reading the manual, the construction crew was too tired to lay a single brick.

5. The Big Takeaway

The paper concludes that while AI is getting better at writing code, it is currently terrible at "grounding" that code into a specific, existing project.

The Problem: The AI knows how to write code generally, but it doesn't know the "rules of the house" (the specific files, variables, and structures already present in the Unity project).
The Bottleneck: The biggest hurdle isn't creativity; it's reliability. Until AI can perfectly understand the specific "furniture" already in the room before it tries to add new furniture, it cannot reliably build playable games on its own.

Summary Analogy

Imagine you are trying to teach a robot to fix a specific car (the Unity project).

The AI is a genius mechanic who knows how to fix any car in the world.
The Problem: The AI keeps trying to install a Ferrari engine into a Ford F-150 because it thinks that's what the car should have, or it forgets to tighten the bolts.
The Study: The researchers tried giving the AI a checklist (the IR) to stop it from guessing. The checklist helped the AI stop guessing the engine type, but the checklist was so long and complicated that the AI got confused and forgot how to turn the wrench.

Conclusion: We are close, but we aren't there yet. We need to teach the AI not just how to build, but exactly what is already in the toolbox before it starts building.

1. Problem Statement

The paper addresses a central challenge in Computational Game Creativity: the ability to translate high-level, structured game design concepts into executable artifacts (specifically, playable Unity games) that satisfy both semantic design intentions and strict syntactic/engine-level constraints.

The Gap: While Large Language Models (LLMs) excel at generating code or text, they struggle with executable creative synthesis. In game engines like Unity, code must adhere to specific architectural conventions (e.g., MonoBehaviour lifecycle, component binding, scene graphs) that are often underrepresented in general pretraining corpora.
The Specific Task: The authors investigate whether LLMs can instantiate Goal Playable Concepts (GPCs)—which formalize player-objective relationships via "Goal Patterns"—into working Unity C# scripts.
The Constraint: The generated artifacts must not only be semantically correct (implementing the game logic) but also compile successfully within a specific, pre-existing Unity project structure.

2. Methodology

The study employs a representation-first approach, comparing direct generation against pipelines conditioned on a human-authored Intermediate Representation (IR).

Experimental Setup

Dataset: 26 distinct Goal Patterns (e.g., "Ownership," "Rescue," "Stealth") from the Goal Playable Concepts collection. Each pattern has a reference implementation in a shared Unity project.
Models: Two open-source code-specialized LLMs:
- DeepSeek-Coder-V2-Lite-Instruct
- Qwen2.5-Coder-7B-Instruct
Pipelines & Configurations:
1. Baseline (No Schema): Natural Language (Markdown description of the pattern) $\rightarrow$ C# Code.
2. IR-Conditioned: Natural Language $\rightarrow$ $\to$ Intermediate Representation (JSON) $\rightarrow$ $\to$ C# Code.
  - Free: No schema constraints on the IR generation.
  - Min: Minimal field skeleton (required top-level keys).
  - Full: Complete IR v0.2-runtime-evidence schema with four hard referential-integrity constraints (e.g., script-to-object binding, no aggregate placeholders).

The Intermediate Representation (IR)

The IR (v0.2-runtime-evidence) is a structured JSON schema derived from static analysis of the Unity project. It bridges the gap between abstract patterns and concrete engine implementation by encoding:

Structural Grounding: Concrete project details (Prefab IDs, script class names, component bindings).
Semantic Grounding: Pattern-specific logic (rules, runtime parameters, causal links).
Hard Constraints: Ensures referential integrity (e.g., a script must reference a valid object ID present in the scene).

Evaluation Protocol

Primary Metric: M1 Compile Success. Generated C# scripts are written to a temporary asset, and the Unity Editor (2022.2.23f1) is invoked in batch mode to attempt compilation.
Failure Analysis: Compiler error logs are extracted and categorized into two distinct failure modes:
1. Grounding Failures: Errors where the model hallucinates types, namespaces, or project-specific structures that do not exist in the target environment (e.g., CS0246: The type or namespace X could not be found).
2. Hygiene Failures: Errors related to syntax corruption, formatting leakage, duplicate declarations, or type coercion, independent of project knowledge (e.g., CS0101: Namespace already contains a definition).

3. Key Results

Universal Compilation Failure

Zero Success: Across all 26 patterns, 2 models, and 4 configurations (1,005 total logs analyzed), no generated artifact achieved successful compilation. pass@k = 0.0 for all $k$ .
Timeout Phenomenon: IR-conditioned configurations exhibited a sharp, monotonic increase in compilation timeouts (from ~40% in the baseline to 96–99% in the "Full" schema configuration). This suggests that adding structural constraints increases the complexity of the generated code to a point where it exhausts the Unity compilation budget.

Failure Mode Analysis

Despite total failure, the distribution of errors revealed critical insights:

Grounding Failures (The "Knowledge" Gap):
- In the No Schema baseline, grounding failures were dominant (36.8% of errors). Models frequently hallucinated non-existent classes or failed to reference correct project IDs.
- IR Impact: The IR successfully eliminated specific architectural grounding errors (e.g., CS0115 override errors dropped from 40 to 0). However, project-level grounding (specifically CS0246 - missing types) persisted even in the "Full" schema configuration, indicating that schema constraints alone cannot fully inject the necessary project-specific knowledge.
Hygiene Failures (The "Syntax" Gap):
- In the No Schema baseline, hygiene failures were the majority (63.2%), dominated by duplicate declarations and unassigned variables.
- IR Impact: As schema strictness increased, the nature of hygiene errors shifted. Duplicate declarations decreased, but formatting leakage (e.g., CS1029 marker comments leaking into code) became dominant. This suggests that while IRs help structure the logic, they introduce new challenges in output sanitization.
Model Differences:
- DeepSeek tended to produce more structural/duplicate errors (CS0101, CS0263).
- Qwen produced more formatting and marker leakage errors (CS1029).
- Both models converged on architectural grounding failures (e.g., CS0115), suggesting these are structural limitations of the generation process rather than model-specific quirks.

4. Key Contributions

Framing Creative Realization: The paper establishes compile-grounded viability as a necessary condition for artifact existence in computational creativity, moving beyond theoretical generation to executable validation.
Execution-Grounded Evaluation Pipeline: An end-to-end workflow from HPC inference to automated Unity batch replay, providing a rigorous metric for "playable" generation.
Unity-Specific IR for Knowledge Injection: The introduction of IR v0.2-runtime-evidence, a schema that encodes both structural conventions and semantic intent, serving as a scaffold for LLM generation.
Structured Failure Taxonomy: The identification and empirical analysis of Grounding vs. Hygiene failures. This taxonomy reveals that while IRs solve architectural grounding, they do not yet solve project-level grounding and introduce new hygiene challenges.

5. Significance and Discussion

The "Grounding Bottleneck": The study concludes that project-level and engine-level grounding are the primary bottlenecks for knowledge-conditioned LLM generation in Unity. Even with a perfect schema, models fail to correctly map abstract patterns to the specific, concrete entities of a target project.
The Complexity Trade-off: There is a critical tension: IRs are necessary to reduce hallucination (grounding), but they increase the structural complexity of the output, leading to compilation timeouts. The current approach is "too sparse for reliable grounding without it, too complex for reliable compilation with it."
Human-Machine Division of Labor: The results suggest a need to relocate the boundary between human and machine. Humans should provide the deep, project-specific knowledge (or richer knowledge injection via RAG/PEFT), while machines handle the generative breadth. Alternatively, the task should be reframed (e.g., generating scene assemblies rather than full compilation-ready scripts).
Future Directions: The authors propose separating the inference steps (NL $\to$ IR for semantics; IR $\to$ C# for syntax) and using constrained decoding or rule-based sanitization to address hygiene failures, while using Graph Neural Networks (GNNs) or Retrieval-Augmented Generation (RAG) to solve grounding failures.

In summary, the paper provides a sobering but highly diagnostic empirical study showing that while LLMs can be guided by structured representations, reliable executable synthesis in complex game engines remains out of reach due to unresolved grounding and compilation tractability issues.