Conjuring Semantic Similarity

Imagine you have two friends, Alice and Bob. You want to know how similar their "personalities" are.

The Old Way (Text-Based):
Traditionally, to compare them, you'd ask them to describe themselves. If Alice says, "I love hiking and dogs," and Bob says, "I enjoy walking in nature and canines," a computer would look at the words. It sees "hiking" and "walking" are different, and "dogs" and "canines" are different. It tries to guess they are similar based on how often these words appear together in books. It's like comparing two people by reading their resumes.

The New Way (Image-Based):
This paper proposes a fun, magical alternative. Instead of asking them to describe themselves, you ask them to imagine a scene.

You tell Alice: "Imagine a dog."
You tell Bob: "Imagine a canine."

Then, instead of comparing their words, you look at the mental movies playing in their heads.

If Alice's mental movie shows a Golden Retriever playing in a park, and Bob's mental movie shows a Golden Retriever playing in a park, you know they are very similar.
If Alice imagines a Golden Retriever, but Bob imagines a Chihuahua on a skateboard, you know their "meanings" are quite different, even though they used similar words.

The Problem with Humans

The catch is that humans are bad at this. We can't easily project our mental movies onto a screen to compare them side-by-side. We just have to guess.

The Magic Trick (AI)

The authors of this paper say: "Let's use AI to do the projecting."

They use a type of AI called a Diffusion Model. Think of this AI as a magical artist who starts with a canvas full of static noise (like TV snow). When you give it a prompt (a text description), it slowly "denoises" the static, turning the chaos into a clear picture.

Prompt A: "Snow Leopard"
Prompt B: "Bengal Tiger"

The AI starts with the same random noise for both.

It slowly turns the noise into a Snow Leopard.
It slowly turns the same noise into a Bengal Tiger.

The "Conjuring" Method

The paper's big idea is to measure the distance between these two prompts not by looking at the final pictures, but by watching how the AI transforms the noise.

Imagine the AI is a chef.

Prompt A tells the chef to make a Spicy Curry.
Prompt B tells the chef to make a Sweet Cake.

The "Old Way" would be to taste the final dishes and say, "Well, one is salty and one is sweet, so they are different."

The "New Way" (Conjuring Semantic Similarity) watches the entire cooking process.

At step 1, the chef adds salt.
At step 2, the chef adds sugar.
At step 3, the chef adds flour.

The paper calculates the "distance" by measuring how much the chef's actions differ at every single step of the process. If the chef has to make huge changes to the ingredients to turn the "Snow Leopard" recipe into the "Bengal Tiger" recipe, the prompts are very different. If the changes are tiny, the prompts are very similar.

Why is this cool?

It's Visual: Instead of giving you a boring number (like "75% similar"), it gives you a visual explanation. You can literally see the AI morphing a Snow Leopard into a Tiger, showing you exactly where the differences lie (e.g., changing spots to stripes).
It Understands "Vibe": It captures the feeling of the image. Two words might be totally different textually, but if they conjure the same kind of image in the AI's mind, the method knows they are similar.
It Tests the AI: It helps us understand what the AI actually "thinks." For example, the paper found that while the AI is great at understanding nouns (like "dog" vs. "cat"), it gets a bit confused with verbs and adjectives. It's like the AI is a great painter but a poor poet.

The Catch

The main downside is speed. To do this, the AI has to "paint" the picture many times to get an accurate measurement. It's like asking the chef to cook the meal 100 times just to measure the difference between two recipes. It's computationally expensive, but the insight it gives is worth it.

In summary: This paper teaches us how to measure how similar two ideas are by asking an AI to "dream" images of them and watching how the dream changes from start to finish. It turns abstract word comparisons into a visual, step-by-step movie.

1. Problem Statement

The paper addresses the challenge of defining and measuring semantic similarity for text-conditioned generative models, specifically diffusion models.

The Gap: While Large Language Models (LLMs) have established methods for measuring semantic similarity based on text continuations or embeddings, there is no standard framework for quantifying the semantic space of image generation models.
The Limitation of Current Approaches: Existing metrics for diffusion models (e.g., FID, CLIP score) focus on image quality or diversity but fail to measure the semantic alignment between the model's learned representations and human understanding. Furthermore, traditional semantic similarity relies on text-to-text comparison, which ignores the "visual grounding" inherent in image generation models.
The Core Question: How can we measure the distance between two textual prompts based on the imagery they evoke in a diffusion model, rather than how they can be rephrased?

2. Methodology

The authors propose a novel approach called "Conjuring Semantic Similarity," which defines semantic distance based on the divergence between the image distributions induced by two different text prompts.

Theoretical Framework

Diffusion as SDEs: The method treats text-conditioned diffusion models as Stochastic Differential Equations (SDEs). For a prompt $y$ , the reverse-time SDE is defined as:
$dx = [f(x, t) - g(t)^2 s_\theta(x, t|y)]dt + g(t)d\bar{w}_t$
where $s_\theta$ is the score function conditioned on text $y$ .
Semantic Distance via Jeffreys Divergence: To compare two prompts $y_1$ and $y_2$ , the authors compute the Jeffreys Divergence (symmetrized Kullback-Leibler divergence) between the path measures of the two induced SDEs.
Monte-Carlo Estimation: Using the Girsanov theorem and Novikov's condition, the authors derive that the KL divergence between the two SDEs simplifies to an expectation over the squared Euclidean distance of the score functions (denoising predictions) at each timestep:
$d(y_1, y_2) \propto \mathbb{E}_{t, x} \left[ g(t)^2 \| s_\theta(x, t|y_1) - s_\theta(x, t|y_2) \|_2^2 \right]$
This allows the semantic distance to be computed directly via Monte-Carlo sampling:
1. Sample a noise vector $x_T$ from the prior.
2. Run the denoising process for both $y_1$ and $y_2$ to generate sequences of intermediate states.
3. Calculate the squared difference between the model's predictions ( $s_\theta$ ) for both prompts at every timestep.
4. Average these differences over multiple samples and timesteps.

Algorithm

The process is summarized in Algorithm 1:

Initialize distance $d=0$ .
For $k$ $k$ Monte-Carlo steps:
- Sample $x_T \sim \pi$ .
- Denoise $x_T$ with prompt $y_1$ to get sequence $\hat{x}$ .
- Denoise $x_T$ with prompt $y_2$ to get sequence $\tilde{x}$ .
- Accumulate the sum of squared differences between the score predictions of the two sequences.
Return the average distance.

3. Key Contributions

Visual-Grounded Semantic Space: The paper introduces a definition of semantic similarity that is purely "visually-grounded," measuring similarity based on the divergence of generated image distributions rather than textual rephrasing.
Interpretability: Unlike black-box embedding distances, this method provides a visual "explanation" for semantic differences. By observing how the model transforms noisy images differently for two prompts (e.g., changing spots to stripes), one can visualize the semantic shift (see Figure 1 in the paper).
First Quantitative Alignment Metric for Diffusion Models: It is the first method to quantify how well the semantic representations learned by text-conditioned diffusion models align with human annotations.
Efficient Computation: The method is computationally feasible, requiring only a small number of Monte-Carlo steps ( $k \approx 3-5$ ) and a limited number of timesteps ( $T=10$ ) to converge.

4. Experimental Results

The authors evaluated their method using the Semantic Textual Similarity (STS) and SICK-R datasets, comparing their scores against human annotations using Spearman Correlation.

Performance vs. Baselines:
- The proposed method achieved a Spearman correlation of ~65.4 on average across STS benchmarks.
- It outperformed zero-shot encoder-based models (e.g., BERT-mean: ~54.8) and matched or rivaled autoregressive LLMs (e.g., LLaMA-33B: ~66.6).
- While it did not surpass specialized contrastive embedding models like CLIP (trained specifically for this task), it demonstrated that semantic structures can be extracted effectively from diffusion models without specific semantic training.
Qualitative Analysis:
- Taxonomy Clustering: Pairwise distance matrices showed that the method correctly clusters words by hypernym classes (e.g., dog breeds cluster together; marine animals cluster together).
- Part-of-Speech Sensitivity: The method preserved semantic relations for nouns well but showed degradation for verbs and adjectives, suggesting that diffusion models (often bottlenecked by text encoders like CLIP) struggle to capture fine-grained semantic relations for non-nominal concepts during the reverse diffusion process.
Ablation Studies:
- Timesteps: A uniform prior over all timesteps yielded the best results.
- Iterations: The method converges quickly with few Monte-Carlo steps ( $k=3$ to $5$).
- Model Robustness: Results were consistent across different Stable Diffusion versions (v1.4, SD3, SD-XL).

5. Significance and Limitations

Significance:

New Evaluation Paradigm: Shifts the evaluation of generative models from "image quality" to "semantic alignment," offering a way to audit what a model has actually learned about the world.
Interpretability: Provides a mechanism to "see" semantic differences, making the latent space of diffusion models more transparent.
Foundation for Future Work: Opens avenues for evaluating and improving text-conditioned generative models by quantifying their alignment with human concepts.

Limitations:

Scope: Not intended as a general-purpose retrieval metric; it is specific to evaluating the semantic space of image generators.
Abstract Concepts: Struggles with non-visual concepts (e.g., "imaginary numbers," "conscience") where imagery cannot fully capture meaning.
Bottleneck: The semantic quality is limited by the pre-trained text encoder (e.g., CLIP) used to condition the diffusion model.
Ambiguity: Does not resolve linguistic ambiguity; it simply reveals which visual interpretation the model has associated with a prompt.
Compute Cost: Requires multiple inference passes through the diffusion model, which is computationally heavier than simple embedding lookups, though ablation studies show it is manageable.

In conclusion, "Conjuring Semantic Similarity" offers a rigorous, mathematically grounded, and visually interpretable framework for understanding the semantic capabilities of diffusion models, bridging the gap between text-based semantics and visual generation.

Conjuring Semantic Similarity

The Problem with Humans

The Magic Trick (AI)

The "Conjuring" Method

Why is this cool?

The Catch

1. Problem Statement

2. Methodology

Theoretical Framework

Algorithm

3. Key Contributions

4. Experimental Results

5. Significance and Limitations

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems