Grounding Machine Creativity in Game Design Knowledge Representations: Empirical Probing of LLM-Based Executable Synthesis of Goal Playable Patterns under Structural Constraints

This paper empirically investigates whether large language models can synthesize executable Unity game code from Goal Playable Patterns under strict structural constraints, revealing that while intermediate representations improve performance, project-level grounding and hygiene failures remain primary bottlenecks in achieving high compilation success rates.

Hugh Xuechen Liu, Kıvanç Tatar

Published 2026-03-10
📖 6 min read🧠 Deep dive

Imagine you have a brilliant idea for a new video game mechanic. You can describe it perfectly in words: "The player collects glowing orbs, and when they get ten, the boss gets angry and the level changes."

Now, imagine you want a robot to take that description and instantly build a working, playable game file for you. That is the dream of Computational Game Creativity.

This paper is a report card on how well our current "super-smart robots" (called Large Language Models or LLMs) are doing at this specific task. The short answer? They are trying very hard, but they are failing to build a single working game.

Here is the breakdown of their experiment, explained with simple analogies.

1. The Goal: Turning Blueprints into Buildings

The researchers wanted to see if an AI could take a "Game Design Pattern" (a fancy blueprint for how a game mechanic works) and turn it into actual code that runs in Unity (a popular game engine).

Think of it like this:

  • The Blueprint: A human-written description of a game rule (e.g., "The player must hide from the guard").
  • The Builder: An AI (the LLM).
  • The Building: A working Unity game project.

The challenge is that Unity is a very strict architect. If you miss one tiny screw or use the wrong type of brick, the whole building collapses and won't open.

2. The Experiment: Two Ways to Build

The researchers tested two different ways to ask the AI to build these games:

  • Method A (The "Just Tell Me" Approach): They gave the AI the game description in plain English and said, "Write the code."
    • Analogy: You tell a contractor, "Build me a house with a red door," and they start hammering without looking at the blueprints or checking if they have the right nails.
  • Method B (The "Strict Blueprint" Approach): They forced the AI to first write a structured "Intermediate Representation" (IR)—a detailed, technical JSON list of every object, script, and connection needed—before writing the actual code.
    • Analogy: You force the contractor to first fill out a 50-page form listing every single brick, wire, and pipe, and only then are they allowed to start building.

They used two different AI models (DeepSeek and Qwen) and 26 different game patterns. They ran this 4,160 times.

3. The Result: The "Zero Success" Rate

The bad news: 0% of the generated games worked. Not a single one compiled successfully. If you tried to open them in Unity, they would crash immediately.

The good news: The researchers didn't just say "It failed." They acted like forensic detectives to figure out exactly why it failed. They found two main types of mistakes:

Type 1: The "Hallucination" Failures (Grounding Errors)

The AI made up things that didn't exist.

  • The Metaphor: The AI is like a chef who is told to make a "Spicy Tofu Stir-fry." The chef confidently writes a recipe that includes "Tofu," "Chili," and "Unicorn Meat."
  • The Reality: The Unity game engine doesn't have "Unicorn Meat." The AI invented a game object or a code function that doesn't exist in the specific project it was working on. It knew the concept of a game, but it didn't know the specific ingredients available in the kitchen.

Type 2: The "Messy Kitchen" Failures (Hygiene Errors)

The AI got the ingredients right, but the kitchen was a disaster.

  • The Metaphor: The chef has the right ingredients, but they forgot to put the salt in the pot, they wrote the recipe on a napkin that was torn in half, or they tried to use a fork to chop vegetables.
  • The Reality: The code had syntax errors, missing semicolons, duplicate names, or formatting issues. The logic was there, but the "grammar" was broken.

4. The Surprising Twist: The "Strict Blueprint" Made It Worse

You might think that forcing the AI to fill out the strict JSON form (Method B) would help. Surprisingly, it made things worse in a specific way.

  • Without the form: The AI failed about 40-50% of the time due to "messy kitchen" errors.
  • With the form: The AI failed 96-99% of the time.

Why?
When the AI tried to follow the strict, complex JSON form, it got overwhelmed. It started writing code that was so structurally complex and long that the Unity engine simply gave up trying to read it. It was like asking a contractor to build a skyscraper based on a 1,000-page manual; by the time they finished reading the manual, the construction crew was too tired to lay a single brick.

5. The Big Takeaway

The paper concludes that while AI is getting better at writing code, it is currently terrible at "grounding" that code into a specific, existing project.

  • The Problem: The AI knows how to write code generally, but it doesn't know the "rules of the house" (the specific files, variables, and structures already present in the Unity project).
  • The Bottleneck: The biggest hurdle isn't creativity; it's reliability. Until AI can perfectly understand the specific "furniture" already in the room before it tries to add new furniture, it cannot reliably build playable games on its own.

Summary Analogy

Imagine you are trying to teach a robot to fix a specific car (the Unity project).

  • The AI is a genius mechanic who knows how to fix any car in the world.
  • The Problem: The AI keeps trying to install a Ferrari engine into a Ford F-150 because it thinks that's what the car should have, or it forgets to tighten the bolts.
  • The Study: The researchers tried giving the AI a checklist (the IR) to stop it from guessing. The checklist helped the AI stop guessing the engine type, but the checklist was so long and complicated that the AI got confused and forgot how to turn the wrench.

Conclusion: We are close, but we aren't there yet. We need to teach the AI not just how to build, but exactly what is already in the toolbox before it starts building.