CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

Imagine you have a giant, super-smart robot chef. For years, we've fed this chef every recipe ever written on the internet. Now, the chef can cook almost anything perfectly. But there's a problem: the chef has run out of new recipes to learn from. The internet is "saturated." If we just keep feeding the chef more of the same old recipes, they won't get any smarter; they'll just get better at memorizing.

To make the chef truly creative, we need to teach them how to invent new dishes, not just copy old ones. This is exactly what the paper "CreativeBench" is about. It's a new way to test if AI can be a true inventor, and a new trick to help them be more creative.

Here is the breakdown in simple terms:

1. The Problem: The "Copy-Paste" Trap

Right now, we test AI by asking it to solve standard puzzles (like "Write a function to sort a list"). The AI is great at this because it has seen millions of similar puzzles. But this doesn't test creativity. It's like testing a painter by asking them to copy a photo of a cat. If they copy it perfectly, they aren't creative; they're just a photocopier.

We need to know if the AI can:

Mix things together in weird ways (like putting a pizza topping on a sushi roll).
Explore new paths when the usual path is blocked (like finding a way to cross a river when the bridge is out).

2. The Solution: "CreativeBench" (The Creativity Gym)

The authors built a special gym called CreativeBench to train and test AI creativity. They split the gym into two rooms:

Room A: The Mix-Master (Combinatorial Creativity)
- The Analogy: Imagine you have a box of Lego bricks from a castle set and a box from a spaceship set. The challenge is to build a new vehicle that uses parts from both, but works perfectly.
- How it works: The AI is given code from two different fields (like music theory and graph maps) and asked to fuse them into one working program.
- The Test: The code must actually run. If it crashes, it's not creative; it's just a hallucination (a made-up idea that doesn't work).
Room B: The Obstacle Course (Exploratory Creativity)
- The Analogy: Imagine you are driving to work, but the main road is closed. You can't use your GPS (the usual way). You have to find a new route through back alleys, fields, and parks to get to the same destination.
- How it works: The AI is given a problem, but with a "Negative Constraint" (e.g., "You cannot use loops" or "You cannot use the standard math formula"). It must find a completely different way to solve the problem.
- The Test: Did it solve the problem? Yes. Did it avoid the forbidden trick? Yes. Is the solution different from the standard one? Yes.

3. The Scorecard: Quality × Novelty

How do you grade creativity? The authors created a simple formula:

Creativity Score = Quality × Novelty

Quality: Does the code actually work? (If it's a weird new dish, does it taste good, or is it just a pile of dirt?)
Novelty: Is it different from what everyone else does? (Is it a unique flavor, or just a copy of a McDonald's burger?)

If an AI writes a perfect, standard solution, its score is low because it's not novel. If it writes a wild, unique solution that crashes, the score is low because it's not quality. You need both.

4. What They Discovered (The "Aha!" Moments)

When they tested the world's smartest AI models in this gym, they found some surprising things:

Bigger isn't always more creative: Making the AI model bigger (adding more "brain power") makes it better at Room A (Mixing things). It gets really good at combining known ideas. But for Room B (Exploring new paths), bigger models actually get worse. They become too confident in their usual ways and refuse to try risky, new paths. They get "stuck" in their comfort zone.
Reasoning helps, but only sometimes: When the AI is told to "think step-by-step" (Reasoning mode), it gets much better at navigating the Obstacle Course (Room B). But it doesn't help much with mixing things together (Room A).
The "Convergence" Effect: As models get bigger, they all start sounding the same. They become very correct, but very boring. They converge on the "safe" answer.

5. The Magic Trick: "EvoRePE" (The Creativity Booster)

The authors didn't just stop at testing; they wanted to fix the problem. They noticed that when AI models try to solve these hard problems using "evolutionary" methods (trying many variations and keeping the best ones), they develop a specific "creative pattern" in their brain.

They created a tool called EvoRePE.

The Analogy: Imagine you have a radio. Usually, it plays standard pop music. But the authors found a hidden frequency that plays "Jazz Improvisation." They built a little antenna (a vector) that, when plugged into the radio, forces it to tune into that Jazz frequency.
How it works: They extracted a "Creativity Vector" from successful creative attempts and injected it into the AI while it was thinking.
The Result: Suddenly, the AI started generating more creative solutions without needing to be retrained or run expensive evolutionary searches. It's like giving the AI a "creative mindset" switch.

Summary

This paper is a wake-up call. It tells us that simply making AI models bigger won't make them more creative. In fact, it might make them more rigid.

To get true machine creativity, we need:

Better Tests: Like CreativeBench, which forces AI to mix ideas and navigate obstacles.
Better Steering: Like EvoRePE, which acts as a "creative nudge" to help the AI break out of its safe, boring habits and try something new.

It's the difference between a robot that can recite the dictionary and a robot that can write a poem that makes you cry. CreativeBench is the tool to help us get there.

1. Problem Statement

The rapid advancement of Large Language Models (LLMs) has been driven by scaling up pre-training data. However, the saturation of high-quality web data has created a bottleneck, shifting research focus toward evolutionary systems (e.g., AlphaEvolve) that can continuously generate novel artifacts.

A critical gap exists in evaluating these systems:

Lack of Rigorous Metrics: Existing benchmarks (e.g., HumanEval, LiveCodeBench) focus primarily on functional correctness (Pass@k), ignoring the "creative" dimension of problem-solving.
Hallucination vs. Creativity: Current methods struggle to objectively distinguish between genuine creative innovation and hallucinated or rote-memorized solutions.
Task Complexity: Many existing tasks are too simple, allowing models to succeed via memorization rather than true exploration or recombination.
Theoretical Gap: There is no standardized framework grounded in cognitive science to categorize and measure machine creativity in code generation.

2. Methodology

The authors propose CreativeBench, a comprehensive benchmark and evaluation framework grounded in Boden's cognitive creativity framework, which distinguishes between two types of creativity:

Combinatorial Creativity: Combining familiar concepts in unfamiliar ways (e.g., merging graph algorithms with music theory).
Exploratory Creativity: Navigating a structured conceptual space to discover new possibilities, often by overcoming constraints.

A. Dataset Construction (Self-Evolving Pipeline)

CreativeBench is constructed via a fully automated, human-free pipeline using GPT-4.1 as the seed generator, ensuring high difficulty and preventing data leakage. It consists of two subsets:

CreativeBench-Combo (Combinatorial):
- Strategy: Reverse Engineering.
- Process: The system fuses code components from different domains (e.g., data processing + graph algorithms) into a single solution. It then generates test cases from this verified solution and reverse-engineers a problem description that requires this specific fusion.
- Goal: Ensure the task is inherently solvable but requires cross-domain synthesis.
CreativeBench-Explore (Exploratory):
- Strategy: Self-Play with Dynamic Constraint Stacking.
- Process: A "Constraint Generator" analyzes a solver's solution and introduces a new negative constraint (e.g., "Do not use binary search" or "No loops"). The "Solver" must refine the solution to satisfy the new constraint. This loop continues until the solver fails, creating a hierarchy of increasingly difficult, constrained problems.
- Goal: Force the model to explore low-probability regions of the solution space.

B. Evaluation Metrics

The paper introduces a Unified Creativity Score defined as the product of Quality and Novelty:
$\text{Creativity} = \text{Quality} \times \text{Novelty}$

Quality: Measured by Pass@1 (execution correctness in a sandbox) and verified by an LLM-as-a-Judge for constraint adherence.
Novelty: Quantified as the deviation from a baseline solution.
- Semantic Distance: Cosine distance between code embeddings (using CodeXEmbed).
- Lexical Distance: Character-level 4-gram Jaccard distance to penalize superficial edits (renaming variables, formatting).
- Result: A solution must be both correct and structurally distinct to achieve a high score.

C. Enhancement Strategy: EvoRePE

To enhance creativity without retraining, the authors propose EvoRePE (Evolutionary Representation Engineering).

Mechanism: It extracts a "creativity vector" by analyzing the difference in hidden states between standard solutions and evolved (creative) solutions found via evolutionary search.
Application: This vector is injected into the model's residual stream during inference (activation steering) to guide the model toward more creative behaviors.

3. Key Contributions

CreativeBench Benchmark: The first machine creativity benchmark grounded in Boden's cognitive framework, covering 14 programming domains with 1,859 problems. It uniquely separates combinatorial and exploratory creativity.
Objective Evaluation Framework: A rigorous metric system that distinguishes creativity from hallucination using executable code, combining semantic and lexical novelty measures.
Empirical Insights:
- Scaling Favors Combination: Model scaling significantly improves combinatorial creativity (recombination) but yields diminishing returns for exploratory creativity.
- Convergence-by-Scaling: Larger models become more correct but less divergent (more "convergent"), suppressing the exploration of novel solution paths.
- Reasoning Asymmetry: Reasoning capabilities (Chain-of-Thought) significantly aid exploratory creativity (constraint satisfaction) but offer little benefit to combinatorial creativity.
EvoRePE: A plug-and-play inference-time steering method that internalizes evolutionary search patterns, consistently boosting creativity scores across different models and evolutionary baselines.

4. Experimental Results

Benchmark Difficulty: Even state-of-the-art models (e.g., Gemini-3-Pro) achieve Pass@1 rates below 60%, confirming the benchmark's resistance to memorization.
Scaling Analysis (Qwen2.5 Family):
- As model size increases (1.5B $\to$ 72B), Pass@1 (Quality) increases steadily.
- Novelty declines or plateaus as models converge on high-probability, standard solutions.
- Consequently, Creativity scores rise primarily due to quality gains, not increased divergence.
Reasoning Impact: Enabling "Reasoning Mode" improves performance on Explore tasks (by ~10-20%) but has negligible effect on Combo tasks.
EvoRePE Efficacy:
- Applying EvoRePE to Qwen2.5-7B increased the Creativity score from 0.174 to 0.192 (Combo) and 0.0146 to 0.0169 (Explore) when combined with evolutionary baselines.
- It provides gains even on vanilla prompting, suggesting evolutionary patterns can be distilled into latent representations.

5. Significance and Future Directions

Paradigm Shift: The paper moves the evaluation of LLMs from "can it solve the problem?" to "how creatively can it solve the problem?" This is crucial for developing systems capable of open-ended scientific and algorithmic discovery.
Internalizing Evolution: The success of EvoRePE suggests that the benefits of computationally expensive evolutionary search can be distilled into a static "creativity vector," making creative generation more efficient.
Future Work: The authors suggest extending this framework to non-code domains (storytelling, music, 3D design) and using these metrics to guide the training of next-generation models specifically for open-ended evolution.

In summary, CreativeBench provides the first rigorous, quantitative lens through which to view machine creativity, revealing that current scaling laws favor correctness over divergence, and offering a practical method (EvoRePE) to steer models toward more innovative solutions.