CREATE: Testing LLMs for Associative Creativity

Imagine you are playing a game of "Connect the Dots," but instead of drawing lines between numbers, you are connecting real-world people, places, and ideas with invisible threads of knowledge.

This paper introduces a new game called CREATE. It's designed to test how "creative" Artificial Intelligence (AI) really is. But here's the twist: it doesn't ask the AI to write a poem or paint a picture. Instead, it asks the AI to find hidden, interesting, and surprising connections between two things that seem unrelated.

The Core Idea: The "Mental Web"

Think of an AI's brain as a giant, messy spiderweb of facts.

The Old Way: Most AI tests ask the AI to find the nearest thread. "Who is the father of George Washington?" (Easy: George Washington's father).
The CREATE Way: This test asks the AI to find the weird threads. "How is Dakota Johnson connected to a character in Lord of the Rings?"

A normal answer might be: "They both live in the US." (Boring, weak thread).
A creative answer might be: "Dakota Johnson is the stepdaughter of Antonio Banderas. Antonio Banderas was in Shrek. Shrek is a fantasy movie, just like Lord of the Rings." (Surprising, strong, and fun thread).

The Rules of the Game

To win at CREATE, the AI has to follow two golden rules:

Be Specific (The "Strong Thread" Rule):
Imagine you are tying a knot. A weak knot is made of loose string (e.g., "They both exist"). A strong knot is made of steel cable (e.g., "They are step-relatives"). The AI gets points for finding "steel cable" connections, not "loose string" ones.
Be Diverse (The "New Path" Rule):
If the AI finds five ways to connect two people, but all five ways go through "Hollywood movies," it's not very creative. It's just repeating the same pattern. The AI needs to find paths through different worlds—maybe one through sports, one through family, and one through geography. It's like finding five different routes to the same city: one by train, one by boat, one by foot, etc.

The Scoreboard: "Creative Utility"

How do we grade the AI? The authors invented a score called Creative Utility.
Think of it like a treasure hunt.

If the AI finds one amazing treasure (a very specific, surprising connection), that's good.
If it finds many treasures, but they are all the same kind of rock, that's okay.
If it finds many different kinds of treasures (gold, jewels, ancient maps) that are all high quality, that's a perfect score.

The paper also introduces a "Patience" meter. If a user is impatient, they only care about the best single answer. If they are patient, they want a whole list of great, different answers. The AI is scored based on how well it handles both.

What Did They Find? (The Plot Twist)

The researchers tested the smartest AI models available (like GPT-5, Claude, and Gemini). Here is what happened:

The "Thinking" Trap: You might think that if you tell an AI to "think harder" or give it more time to reason, it will get more creative. Not necessarily. The paper found that just giving the AI more "thinking tokens" (more time to chat with itself) didn't always make it find better connections. Sometimes, it just made the AI repeat the same ideas over and over.
The "Creative Prompt" Myth: Telling the AI "Be creative!" in the instructions didn't help much. It's like telling a person to "be funny" right before a joke; it doesn't guarantee a laugh. The AI needs better tools, not just better instructions.
The Factuality Trade-off: The AI models that found the most creative and diverse paths sometimes made up facts (hallucinations). The models that were most careful about being 100% true to facts sometimes played it too safe and found boring connections. The best models had to learn a delicate balance: Be bold, but be true.

Why Does This Matter?

We often ask AI to write stories or solve math problems. But true creativity is about seeing the world differently. It's about connecting a "what" with a "why" in a way no one expected.

This paper is like a gym for the AI's imagination. It shows us that while AI is getting better at finding these hidden connections, it still struggles to be truly "human" in its creativity. It often gets stuck in loops or plays it too safe.

In a nutshell: CREATE is a new test that asks AI, "Can you connect the dots in a way that surprises me, without making things up?" The answer is: "Getting there, but we still have a long way to go."

Here is a detailed technical summary of the paper "CREATE: Testing LLMs for Associative Creativity."

1. Problem Statement

The paper addresses a critical gap in evaluating Large Language Models (LLMs): how to objectively measure "associative creativity." While creativity is central to scientific discovery, hypothesis generation, and problem-solving, existing benchmarks often fail to capture it.

Limitations of Current Benchmarks: Real-world creative tasks are subjective and hard to evaluate at scale. Conversely, symbolic or abstract tasks (e.g., finding connections between "brick" and "bottle") are too easy for LLMs due to their vast training data and do not reflect the complexity of real-world reasoning.
The Core Challenge: The authors define the problem as the ability to generate novel yet meaningful connections between real-world concepts. This requires navigating a massive search space to find paths that are both high-quality (specific, factual, strong) and diverse (distinct from other generated paths).

2. Methodology: The CREATE Benchmark

The authors introduce CREATE (Connecting Real-world Entities via Associative Thinking), a benchmark designed to evaluate LLMs on open-ended associative reasoning using real-world knowledge graphs.

A. Task Formulation

Input: A natural language query asking for connections between two specific entities (e.g., "How is Dakota Johnson connected to people who starred in fantasy/sci-fi movies?").
Output: A set of paths (sequences of triples) connecting the start entity to the target condition.
Constraints: Paths must be structurally valid (continuous chains of entities) and factually correct.

B. Data Construction

Source: Wikidata.
Process: The authors manually selected 12 diverse relation-category pairs (e.g., cast member, Goodfellas). They generated queries by:
1. Selecting two entities from a class (e.g., two actors in Goodfellas).
2. Expanding one entity with an additional informative hop (e.g., Actor A $\to$ Occupation $\to$ Painter).
3. Formulating a query linking the first entity to the expanded entity's attribute.
Dataset Size: 931 natural language queries covering domains like movies, politics, genes, and chemistry.

C. Evaluation Metrics

The paper proposes a unified framework to measure creativity based on Quality and Diversity.

Quality ( $f(u)$ ):
- Based on Specificity: A path is strong if its relations are exclusive (e.g., "step-daughter" is more specific than "citizen of the same country").
- Calculated as the specificity of the weakest triple in the path.
- Factuality is a binary gate ( $q(u)=1$ if all triples are true).
Distance ( $d(u_i, u_j)$ ):
- Measured via cosine distance of string embeddings of the paths.
- A transformation function rescales distances to penalize trivial paraphrasing while rewarding distinct conceptual jumps.
Creative Utility ( $s(U)$ ):
- A unified metric combining quality and diversity, inspired by submodular functions.
- Formula: $s(U) = \max_{\tau} \sum \gamma^{i-1} f(u_{\tau(i)}) \min_{j<i} d(u_{\tau(i)}, u_{\tau(j)})$ .
- Patience ( $\gamma$ ): A parameter controlling the trade-off between quantity and quality. Lower $\gamma$ favors high-quality single paths; higher $\gamma$ rewards larger, diverse sets.
Distinctiveness ( $\nu$ ): Measures how far a model's best output is from the "population" of all model outputs, capturing historical novelty.

3. Key Contributions

The CREATE Benchmark: A novel, knowledge-grounded benchmark that bridges the gap between abstract symbolic reasoning and subjective real-world creativity. It is verifiable (via knowledge graphs) yet complex enough to challenge frontier models.
Unified Metric for Creativity: The introduction of the Creative Utility metric, which mathematically balances the strength of a connection against its uniqueness relative to other generated ideas.
Empirical Analysis of "Thinking" Models: The paper rigorously tests whether "Chain-of-Thought" (CoT) or "Thinking" models (which spend more tokens on reasoning) actually improve creative performance.
Prompting Interventions: Evaluation of strategies like "Be Creative," "Verbalized Sampling," and "Iterative Regeneration" to see if they can unlock better associative reasoning.

4. Results

The authors evaluated frontier models (GPT-4.1, GPT-5, Claude-3/4, Gemini-3, Qwen, OLMo) and found:

Frontier Models Lead: The strongest models (GPT-5, Gemini-3-Pro) achieve the highest creative utility scores, generating both high-quality and diverse paths.
The "Thinking" Paradox: Increasing reasoning effort (token budgets) does not necessarily lead to higher creativity scores.
- "Thinking" models often explore similar entities and relations as non-thinking models but with more repetition.
- Simply spending more compute does not guarantee finding the "needle in the haystack" (highly specific, obscure connections).
Prompting Limitations:
- Iterative/Resampling: Asking the model to generate different answers after seeing previous ones (Iterative) or resampling (Resample) yields the best improvements in utility.
- "Be Creative" Instructions: Merely telling the model to "be creative" has negligible impact.
- Verbalized Sampling: Asking models to output probability distributions significantly reduced the number of valid paths generated.
Quality vs. Factuality Trade-off:
- Models like Gemini-3-Pro generate highly diverse paths but with lower factuality.
- GPT-5 maintains a better balance, achieving high utility even under strict factuality constraints, whereas open-source models drop significantly in performance when factuality is strictly enforced.
Distinctiveness: Even top models struggle to generate paths that are truly distinct from the "population" of other models, suggesting a ceiling in current LLM creativity.

5. Significance

Beyond Retrieval: The paper demonstrates that associative creativity is not just a retrieval task but a structured search problem requiring divergent thinking (exploring many paths) and convergent thinking (pruning to find strong links).
Limitations of Current AI: The results suggest that current scaling laws (more tokens = better reasoning) have diminishing returns for creative tasks. The "search space" of human knowledge is too vast for current models to navigate efficiently without specific architectural or training innovations.
Future Directions: The benchmark provides a "sandbox" for developing new methods to improve LLMs' capacity for associative creativity, which is essential for applications in scientific discovery, hypothesis generation, and automated research.
Responsible AI: The authors emphasize that while these tools can augment human creativity, fully automating creative processes carries risks regarding job displacement and the homogenization of creative output.

In summary, CREATE establishes a rigorous, objective standard for measuring LLM creativity, revealing that while frontier models are capable, they still lack the ability to consistently generate truly novel and distinct associations without significant human guidance or architectural shifts.