Beyond Semantic Similarity: Open Challenges for Embedding-Based Creative Process Analysis Across AI Design Tools

Imagine you are trying to understand how different artists create their masterpieces. Some paint with watercolors, some sculpt with clay, and some design digital worlds.

Currently, when we try to study how these artists work, we usually look at the final picture or the finished statue. We ask, "Is this pretty?" or "Is this useful?" But this is like judging a movie only by the final frame; it tells us nothing about the story, the plot twists, or the journey the characters took to get there.

This paper argues that we need a new way to study the journey of creation, especially when AI is helping the artist. The authors suggest using a tool called "Embedding Analysis," which is like a super-smart translator that turns every word, sketch, or action an artist takes into a mathematical point on a map. If two actions are close together on the map, the computer thinks they are similar.

Here is the problem: The computer's map is too simple. It sees "similarity" only as "looking the same," but it misses the "spark" of creativity.

The authors use three main metaphors to explain why this current method fails and what we need to fix:

1. The "Same Word, Different World" Trap

Imagine a designer is building a tiny apartment.

Step A: They say, "I need stackable modules to save space." (They are thinking about furniture).
Step B: They say, "I need stackable modules to change the room layout." (They are now thinking about architecture).

To a human, this is a huge creative leap. The designer didn't just add more furniture; they completely changed the problem they were solving. They pivoted from "storage" to "spatial design."

But to the computer's "Similarity Map," these two steps look almost identical because they both use the words "stackable" and "modules." The computer thinks, "Oh, they are just continuing the same idea." It misses the plot twist. It's like a GPS that sees you turning a corner and thinks you are still driving straight because the road looks the same.

The Fix: We need to teach the computer to understand the context and the intent, not just the words. It's like hiring a human editor who knows when a character in a story has changed their mind, even if they are using the same vocabulary.

2. The "Mixed Media" Puzzle

Most creative tools today are messy. You might type a prompt, draw a rough sketch, ask the AI to generate an image, and then tweak a slider. It's a mix of text, pictures, and numbers.

Currently, the computer's map mostly understands text.

If you draw a rough sketch and then a polished final version, they might look totally different to a computer (low similarity). But to a human, they are clearly the same idea evolving (high continuity).
Conversely, two very different-looking sketches might actually be part of the same creative strategy.

The challenge is figuring out what counts as a single "move" in this messy mix. Is a single brushstroke a move? Or is the whole sketch a move? Without a clear rule, the computer's map gets scrambled, like trying to build a puzzle where some pieces are 3D and others are flat.

3. The "Self-Driving Car" Problem

This is the most tricky part. In the future, AI won't just be a tool; it will be a partner (an "agent") that makes its own decisions. It will suggest ideas, pick the best ones, and steer the project.

If the AI uses this simple "Similarity Map" to decide what to do next, it might get stuck in a loop.

Imagine an AI agent that is told to "be diverse." It might generate 100 wildly different ideas just to look diverse, not because the human designer is exploring new ground.
The computer sees a "diverse" path and thinks, "Great, the process is working!"
But in reality, the AI is just spinning its wheels, creating noise instead of progress.

The danger is that the AI's own "personality" (how it is programmed) starts to look like the human's creativity. We need a way to tell the difference between "The human is exploring" and "The robot is just following its programming."

The Big Picture

The authors are saying: "We have a great new telescope (Embedding Analysis) to watch how people create, but the lens is blurry."

It sees the surface details but misses the deep meaning. To fix this, they propose using Large Language Models (LLMs) as a "smart guide" to help interpret the data. Instead of just measuring how close two things look, we need to ask, "Did the creator change their mind? Did they solve a new problem?"

By fixing these three issues, we can finally compare how a graphic designer works with an AI versus how a musician works with an AI, giving us a true understanding of human-AI creativity across all fields.

Based on the paper "Beyond Semantic Similarity: Open Challenges for Embedding-Based Creative Process Analysis Across AI Design Tools," here is a detailed technical summary covering the problem, methodology, key contributions, results, and significance.

1. Problem Statement

Current evaluations of AI-based Creativity Support Tools (CSTs) are fragmented and domain-specific.

The Limitation: Existing metrics rely on domain-specific outputs (e.g., font convergence, visual search accuracy) or subjective surveys (e.g., Creativity Support Index, NASA-TLX). This creates "silos" where creative processes in different tools (e.g., sketching vs. text-to-image) cannot be compared.
The Proposed Gap: While embedding-based protocol analysis (specifically fuzzy-linkography) offers a domain-agnostic way to analyze creative processes by mapping semantic relationships between design moves, current implementations have a critical flaw.
The Core Issue: Fixed embedding models measure surface-level semantic similarity. They fail to distinguish between:
1. Genuine conceptual continuity (elaborating on the same idea).
2. Creative pivots (shifting the problem definition or strategy while using superficially similar vocabulary).
- Example: A designer shifting from "stackable chair modules" (furniture storage) to "stackable wall modules" (spatial adaptability) uses similar words. A standard embedding model treats this as a continuous elaboration, whereas it is actually a fundamental shift in the design problem. This misrepresentation leads to inaccurate metrics (e.g., entropy, link density) that fail to capture the true dynamics of ideation.

2. Methodology and Analytical Framework

The paper does not present a new experimental dataset but rather proposes a theoretical framework and research agenda to critique and improve existing analytical methods.

Current Approach (Baseline):
- Uses neural embedding models to project discrete design actions (prompts, sketches, parameter changes) into a shared vector space.
- Constructs a fuzzy linkograph (a directed graph mapping semantic relationships).
- Derives quantitative metrics like entropy (diversity of ideas) and link density (connectivity of ideas) to characterize the process independent of content.
Proposed Intervention:
- The authors argue for moving from fixed embeddings to context-aware interventions using Large Language Models (LLMs).
- LLM Roles:
  1. Preprocessing: Segmenting interaction logs based on meaningful conceptual shifts rather than just turn-taking.
  2. Link Formation: Judging whether two moves share "creative relevance" based on the session's specific task framing and trajectory.
  3. Multimodal Interpretation: Extracting design intent from non-text inputs (sketches/images) rather than reducing them to visual similarity scores.

3. Key Contributions: Three Open Challenges

The paper identifies three specific research challenges that must be addressed to make embedding-based analysis valid for cross-domain comparison:

A. Semantic Similarity vs. Creative Significance

Challenge: Fixed models conflate vocabulary overlap with conceptual continuity.
Implication: Metrics like entropy may reflect "apparent continuity" rather than actual exploration breadth.
Direction: Developing methods where LLMs act as intermediaries to detect creative pivots hidden within similar language, ensuring the linkograph reflects the designer's actual cognitive shifts.

B. Multimodal Trace Integration

Challenge: Most CSTs are multimodal (text, sketch, image, parameters), but current analysis focuses on text.
Complexity:
- Visual Similarity $\neq$ Creative Similarity: A rough sketch and a final render may look different but represent continuous development; two visually similar images may represent different strategies.
- Segmentation Ambiguity: Defining a "design move" is harder in multimodal contexts (e.g., is a sequence of pen strokes + a generation request one move or three?).
Direction: Creating domain-informed but generalizable segmentation heuristics to handle multimodal traces before applying embedding models.

C. Evaluating Agentic AI Systems

Challenge: As CSTs evolve into agentic systems (where AI autonomously steers the design), the evaluation metric enters the generation loop.
The Feedback Loop: If an agent uses embedding similarity to decide its next move, the agent's behavior becomes optimized for the metric, not necessarily for creativity.
- Risk: An agent configured for high diversity might produce high-entropy traces that reflect its policy settings rather than genuine human-AI collaborative creativity.
Direction: Developing methods to disentangle agent policy artifacts from human creative dynamics, potentially through comparative studies across varying levels of agent autonomy.

4. Results and Findings

As a position paper, the primary "result" is the identification of the validity gap in current automated analysis:

Finding: Fixed embedding similarity is insufficient for capturing the dynamics of creative problem-solving because it cannot differentiate between elaboration and reframing.
Finding: Current automated trace analysis fails to handle the ambiguity of multimodal "design moves."
Finding: In agentic systems, embedding-based metrics risk becoming self-fulfilling prophecies that shape agent behavior rather than objectively measuring it.
Proposal: The integration of LLMs as context-aware layers is the most viable path forward to align computational metrics with human creative significance.

5. Significance

Theoretical Impact: The paper challenges the assumption that "semantic similarity" equals "creative similarity," urging the HCI and AI design community to rethink how creative processes are quantified.
Practical Impact: It provides a roadmap for developing universal evaluation standards for CSTs, allowing researchers to compare tools across different domains (e.g., architecture vs. graphic design) without being biased by domain-specific outputs.
Future Research: It sets a concrete agenda for the community to:
1. Validate context-aware LLM interventions against expert judgment.
2. Develop segmentation heuristics for multimodal data.
3. Conduct empirical studies on how AI agency affects process metrics.

In summary, the paper argues that while embedding-based analysis is a powerful tool for cross-domain comparison, it must evolve from static semantic matching to dynamic, context-aware interpretation to truly capture the nuance of human-AI creative collaboration.

Beyond Semantic Similarity: Open Challenges for Embedding-Based Creative Process Analysis Across AI Design Tools

1. The "Same Word, Different World" Trap

2. The "Mixed Media" Puzzle

3. The "Self-Driving Car" Problem

The Big Picture

1. Problem Statement

2. Methodology and Analytical Framework

3. Key Contributions: Three Open Challenges

A. Semantic Similarity vs. Creative Significance

B. Multimodal Trace Integration

C. Evaluating Agentic AI Systems

4. Results and Findings

5. Significance

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation