Code Roulette: How Prompt Variability Affects LLM Code Generation

Imagine you are asking a very talented, but slightly quirky, robot chef to cook you a specific dish. You say, "Make me a spicy pasta." The robot makes a delicious pasta.

But then, you try again. This time, you accidentally type "spicy pasta" (with a typo), or you say "make a hot noodle dish" (using synonyms), or you rephrase it as "I want a pasta that has a kick to it" (paraphrasing).

The big question this paper asks is: Will the robot chef still make you the same pasta, or will the tiny changes in your words cause it to make something completely different? Maybe it makes a salad, or a soup, or a pasta with no sauce at all?

This paper, titled "Code Roulette," investigates exactly this problem, but instead of a robot chef, it's testing Large Language Models (LLMs)—the AI brains behind tools like ChatGPT or Claude—that write computer code.

Here is the breakdown of their findings using simple analogies:

1. The Problem: The "Fragile" Robot

The authors noticed that while AI is great at writing code, it can be incredibly sensitive to how you ask for it.

The Analogy: Think of the AI like a highly sensitive musical instrument. If you press a piano key slightly to the left (a typo), or use a slightly different word for the note (synonym), the instrument might play a completely different song.
Why it matters: If a developer asks an AI to build a login system, and they type it slightly differently tomorrow, the AI might build a different login system. This makes software hard to trust, hard to fix, and hard to maintain.

2. The Experiment: The "Code Roulette" Wheel

To test this, the researchers built a special testing machine (an evaluation pipeline). They didn't just ask the AI once; they spun the "roulette wheel" of language changes:

Keyboard Typos: They intentionally added random typos (like hitting the wrong key).
Synonyms: They swapped words for their cousins (e.g., changing "fast" to "quick").
Paraphrasing: They completely rewrote the sentence while keeping the same meaning (e.g., "Sort this list" became "I need these numbers in order").

They then asked four popular AI models (GPT-4o, Claude, Gemini, and Llama) to write code based on these slightly messed-up prompts.

3. The Results: How the AI Reacted

The results were like watching a car drive over different terrains:

Typos are a Speed Bump (The Bad News): When the researchers added typos, the AI's code changed drastically. Even a few small spelling mistakes made the AI produce code that looked nothing like the original. It's like if you told the chef "spicy pasta" and they suddenly decided to make a pizza because they got confused.
Synonyms and Paraphrasing are Smooth Roads (The Good News): When the researchers just swapped words or rephrased the sentence, the AI was much more stable. It understood the intent even if the words changed. It's like the chef understanding "hot noodle dish" is still pasta.
The "Old vs. New" Puzzle:
- Old Problems: When they tested the AI on famous, old coding problems (like those from LeetCode that the AI has likely memorized), the AI was very stable. It didn't matter how they asked; the AI just recited what it already knew.
- New Problems: When they tested the AI on brand-new, unique problems it had never seen before, the AI became very unstable. Even tiny changes in the prompt caused the code to change wildly. This suggests that for new, creative tasks, the AI is still a bit of a gamble.

4. Why This Matters to You

You might think, "I'm not a programmer, why do I care?"

Trust: If you use AI to help you write code for your business, you need to know that if you ask the same question twice, you get the same answer. If the answer changes every time you rephrase a question, you can't trust the software.
Maintenance: Imagine a team of developers working on a project. If one person asks the AI for a function and gets Version A, and another person asks the same thing (but with different words) and gets Version B, the code will be a mess. It's like building a house where one team uses red bricks and the other uses blue bricks because they asked the architect slightly different questions.
The "Data Contamination" Warning: The paper warns that many AI tests are using old problems that the AI has already memorized. This is like taking a math test where the teacher accidentally gave you the answers beforehand. The AI looks smart, but it's just reciting. The authors created new problems to get a true picture of how smart the AI really is.

The Bottom Line

The paper concludes that AI code generation is currently playing "Code Roulette."

If you are a beginner or a non-expert, you might not know the "perfect" way to ask the AI for code. If you make a small mistake or phrase it differently, you might get a completely different result. The authors hope that by measuring this sensitivity, we can build better AI tools that are more robust, reliable, and trustworthy—so that no matter how you ask, you get the right code.

In short: The AI is brilliant, but it's also a bit fragile. We need to teach it to listen to the meaning of our words, not just the exact spelling, so it can be a reliable partner in building software.

1. Problem Statement

Large Language Models (LLMs) are increasingly used for code generation, lowering barriers to entry for software development. However, the quality and functionality of generated code are highly sensitive to the specific wording of user prompts.

The Core Issue: Users with varying backgrounds, expertise levels, and mental models decompose problems differently, leading to diverse prompt formulations for the same underlying requirement.
The Gap: Current evaluation benchmarks often assume static prompts or focus solely on correctness. There is a lack of systematic understanding regarding how minor textual variations (typos, synonyms, paraphrasing) in a prompt affect the structural consistency of the generated code.
Goal: To quantify the sensitivity of LLMs to prompt variability to determine if small changes in input lead to drastically different code outputs, which impacts code review, maintenance, and user trust.

2. Methodology

The authors propose a model-agnostic evaluation pipeline designed to measure the sensitivity of code generation to prompt augmentations.

A. The Evaluation Pipeline

The pipeline treats the LLM as a function $M: P \to C$ (Prompt to Code).

Baseline Generation: For a given prompt $p$ , generate $n$ independent code samples to establish a reference baseline ( $C_{ref}$ ).
Augmentation: Apply an augmentation function $F(p, r)$ that perturbs the prompt based on a rate $r \in [0, 1]$ .
Perturbed Generation: Generate code samples ( $C_{aug}$ ) for the modified prompts at various augmentation rates.
Distance Measurement: Compute the pairwise distance between every code in $C_{ref}$ and every code in $C_{aug}$ using a distance function $D$ .
Aggregation: Calculate the average distance for each augmentation rate to plot a sensitivity curve.

B. Prompt Augmentation Techniques

Three specific methods were used to simulate natural user variability:

Keyboard Typos: Randomly replacing characters with adjacent keys on a QWERTY keyboard (simulating typing errors).
Synonyms: Replacing words with semantic synonyms using the WordNet database.
Paraphrasing: Using an LLM (Gemini) to rewrite the prompt while maintaining semantic meaning but altering vocabulary and structure.

C. Datasets

To ensure robustness and avoid data contamination, three datasets were utilized:

LeetCode (Old): 20 classic problems (likely in training data, serving as a "contaminated" baseline).
LeetCode (New): 20 problems posted in March 2025 (post-training cutoff, ensuring no prior exposure).
Our Dataset: 22 custom, open-ended tasks (simulations, data science, games) designed to be distinct from standard competitive programming problems.

D. Evaluation Metric: TSED

The authors rejected standard text metrics like BLEU or BERT Score for code evaluation due to ceiling effects and poor correlation with structural differences.

Metric Used: Tree Similarity of Edit Distance (TSED).
Rationale: TSED measures the edit distance between Abstract Syntax Trees (ASTs). It quantifies structural differences rather than semantic correctness.
Why Structural? Even if code is functionally correct, different implementations (e.g., different variable names, loop structures) create maintenance burdens. The study focuses on output consistency rather than just correctness.

3. Key Contributions

Evaluation Pipeline: A novel, agnostic framework for measuring LLM sensitivity to prompt perturbations.
Empirical Sensitivity Analysis: A comprehensive study of four popular LLMs (GPT-4o mini, Claude 3 Haiku, Gemini 2.0 Flash, Llama 3.3 70B) across multiple augmentation types.
New Dataset: A collection of 22 open-ended, non-standard programming tasks to mitigate data contamination issues in future research.
Open Source: Release of the evaluation code and dataset to the community.

4. Key Results

The experiments revealed significant insights into model behavior:

Impact of Augmentation Type:
- Typos: Caused the most rapid decay in code similarity. Similarity dropped sharply between 0.0 and 0.6 augmentation rates, plateauing around a TSED of 0.3 (indicating cardinal structural differences).
- Synonyms & Paraphrasing: Models were significantly more robust to these changes. Similarity dropped initially but then decayed very gradually. Gemini 2.0 Flash showed the highest robustness to synonym augmentation.
Model Consistency:
- GPT-4o mini and Gemini 2.0 Flash exhibited high stability (near-identical outputs) for unaltered prompts even at temperature 0.
- Llama 3.3 and Claude 3 Haiku showed higher inherent instability even with unaltered prompts.
Dataset Sensitivity (Data Contamination):
- LeetCode (Old): Models showed exceptional robustness (low sensitivity) because the problems were likely in their training data.
- LeetCode (New): Models remained stable until ~50% of the prompt was altered.
- Our Dataset (Custom): Models exhibited high sensitivity. Code similarity dropped below 0.5 after only 10% of the prompt was modified. This suggests that for novel, open-ended tasks, LLMs are highly unstable regarding prompt phrasing.
Statistical Significance: Friedman and Kruskal-Wallis tests confirmed that augmentation rates and dataset types have a statistically significant effect on code similarity ( $p < 0.001$ ).

5. Significance and Implications

Trust and Reliability: The study demonstrates that "small" changes in how a user describes a task can lead to "large" changes in the code structure. This challenges the reliability of LLMs for critical development tasks where consistency is required.
Pipeline Design: Developers building LLM-powered coding tools must account for prompt variability. Strategies such as follow-up questions, prompt standardization, or ensemble averaging of outputs may be necessary to stabilize results.
Benchmarking Evolution: The paper highlights the critical issue of data contamination in current benchmarks (like LeetCode). It argues for the use of fresh, open-ended, and custom datasets to truly evaluate LLM capabilities.
Future Directions: The authors suggest extending this work to multi-turn dialogues, functional testing (to distinguish benign diversity from instability), and studying how user expertise levels influence prompt variability.

In conclusion, "Code Roulette" provides a critical lens on the fragility of LLM code generation, showing that while models are robust to some linguistic variations, they are highly sensitive to others, particularly when dealing with novel, complex tasks.