Morphological Addressing of Identity Basins in Text-to-Image Diffusion Models

Imagine you have a giant, magical library where every book is a picture. This library was built by a robot that read billions of books and looked at billions of photos. The robot didn't just store the photos; it learned the vibe of everything. It knows what "Marilyn Monroe" looks like, but it also knows what "platinum blonde hair," "a beauty mark," and "1950s glamour" look like separately.

This paper is about two clever tricks to navigate this library without using the "name tags" (like "Marilyn Monroe" or "Crungus") that the robot might have been told to ignore. Instead, the researchers used Morphological Addressing—which is a fancy way of saying "using the building blocks of language to find specific places in the robot's mind."

Here is the story of their two main discoveries, explained simply:

1. The "Marilyn" Puzzle (Study 1)

The Problem: You can't just ask the robot to draw "Marilyn Monroe" because the library has rules against famous names. Even if you try to describe her, the robot might just draw a generic blonde woman.

The Solution: The researchers realized that Marilyn Monroe isn't just a name; she is a specific intersection of features. Think of it like a Venn diagram.

Circle A: Platinum blonde hair.
Circle B: A beauty mark on the cheek.
Circle C: 1950s Hollywood style.

Where these three circles overlap is "Marilyn." The researchers didn't use her name. Instead, they fed the robot a list of these overlapping features over and over again, teaching it a special "map" (called a LoRA) to find that specific intersection.

The Magic Result:

The Magnet: Once they built this map, they could ask for a simple "portrait of a woman," and the robot would pull the image toward the Marilyn spot.
The Inverse: They also tested what happens if they push the robot away from Marilyn.
- Without the map, the robot just makes weird, broken monsters (like a horror movie).
- With the map, the robot makes something called the "Uncanny Valley." It looks like a human, but slightly wrong—like a doll with hollow eyes. The map was so strong it shaped not just the "good" version, but also the "weird" version. It's like having a magnet that pulls metal toward it, but also pushes other metal into a specific, strange shape.

2. The "Crungus" Hunt (Study 2)

The Problem: The internet had a mystery. People found that if you type a nonsense word like "Crungus" into the robot, it draws the exact same weird creature every time. But "Crungus" doesn't exist! How does the robot know what it is?

The Solution: The researchers looked at Sound Symbolism (Phonesthemes). This is the idea that certain sounds in English naturally feel like certain things.

"Cr-" sounds like crashing or breaking (Crash, Crush, Crumble).
"Sn-" sounds like sneaking or noses (Snout, Sniff, Sneak).
"-oid" sounds like a robot or a thing that resembles something (Android, Humanoid).

They made up 200 new nonsense words using these sound blocks. For example, they made "Snudgeoid" (Sn- + sludge + -oid).

The Magic Result:

When they asked the robot to draw "Snudgeoid," it didn't draw random noise. It drew a robot made of sludge.
When they asked for "Crashax" (Crash + Ax), it drew a rugged off-road vehicle.
When they asked for "Broomix" (Broom + the comic book suffix -ix), it drew a cartoon character that looks like it belongs in an Asterix comic.

Why this matters:
The robot wasn't remembering a picture of a "Snudgeoid" because no one ever took a photo of one. Instead, the robot was building the picture from the sounds. It heard "Sn-" and thought "slimy/metal," heard "-oid" and thought "robot," and glued them together.

The Big Picture: The Library is Organized

The main takeaway is that the robot's brain (its "latent space") isn't a chaotic mess. It's actually very structured, like a city with neighborhoods.

You can find things without names: You can navigate to a specific "neighborhood" (like Marilyn Monroe) just by describing the street signs (features) that lead there.
Sounds have maps: The way a word sounds gives the robot a map to a specific visual neighborhood. If you use the right sound blocks, you can invent a new creature that the robot will draw consistently, even if that creature has never existed before.

In short: The researchers proved that you don't need to know the "secret password" (the name) to find a specific place in the robot's imagination. You just need to know the grammar of the sounds and features that build that place. They turned the robot's brain from a black box into a map we can actually read.

1. Problem Statement

Text-to-image diffusion models (e.g., Stable Diffusion) memorize specific individuals and concepts from their training data. However, accessing these "identity basins" typically requires explicit naming (which is often filtered) or reference images (as in DreamBooth).

The Gap: Current methods assume that to personalize a model or access a specific concept, one must provide the name or visual exemplars.
The Hypothesis: The authors propose that identity basins are not discrete labeled files but intersections of morphological features in latent space. They hypothesize that morphological pressure—applied either through training descriptors or prompt-level phonological structures—can create navigable gradients to address existing concepts without explicit naming or reference data.

2. Methodology

The paper presents two distinct studies to validate this hypothesis at different levels of the generative pipeline.

Study 1: Training-Level Morphology (Identity Navigation)

Objective: Navigate to a memorized identity (Marilyn Monroe) using only constituent feature descriptors, without using her name or reference photos.
Protocol (Self-Distillation Loop):
1. Prompt Design: Construct prompts using intersecting morphological descriptors (e.g., "platinum blonde," "beauty mark," "1950s glamour," "red lips").
2. Synthetic Generation: Generate images using these prompts on the base Stable Diffusion 1.5 (SD1.5) model.
3. Iterative Training: Select the best outputs, train a LoRA (Low-Rank Adaptation) on them, and refine the descriptors. Repeat for 4 rounds.
4. Inverse Conditioning (Push-Pull): Test navigation away from the target using:
  - Arm A (Push): Positive prompts with "shadow" descriptors (opposite features).
  - Arm B (Pull): Negative prompts containing the target's descriptors.
  - Arm C (Push + Pull): Combining both to create maximum directional pressure.
Evaluation: Used ArcFace embeddings to measure identity similarity (cosine similarity) and analyzed phase transitions via LoRA weight sweeps and CFG (Classifier-Free Guidance) variations.

Study 2: Prompt-Level Morphology (Phonestheme Navigation)

Objective: Determine if nonsense words constructed from English phonesthemes (sound-symbolic clusters carrying semantic associations, e.g., cr- for impact, -oid for resembling) generate coherent visual outputs.
Protocol:
1. Candidate Generation: Created 200 novel nonsense words by combinatorially joining phonesthemic onsets (e.g., cr-, sn-), nuclei, and suffixes (e.g., -oid, -ax).
2. Controls: Compared against 100 random pronounceable strings, 50 unpronounceable strings, and 4 known words (positive controls).
3. Generation: Generated 16 images per candidate (3,200+ total) using SD1.5.
4. Metric: Purity@1. A score of 1.0 means all 16 images generated from a candidate are closer to each other in CLIP embedding space than to any image from any other candidate.
5. Contamination Check: Rigorously verified that high-scoring candidates were not real words or existing cultural references (e.g., checking against dictionaries and web corpora).

3. Key Contributions

Morphological Addressing without Exemplars: Demonstrated that LoRA training on constituent feature descriptors can reliably converge on a memorized identity basin (Marilyn Monroe) without ever seeing the target's name or photos.
Bidirectional Coordinate Systems: Showed that trained LoRAs create local coordinate systems that shape not only the target identity but also its inverse. The model exhibits structured "failure modes" (e.g., "uncanny valley" vs. "eldritch" breakdown) when pushed away from the attractor.
Phase Transitions in Latent Space: Identified that identity basins have sharp boundaries. Outputs do not interpolate smoothly between identities but "snap" between attractors at specific LoRA weight thresholds.
Phonestheme-Driven Generation: Proved that sub-lexical sound patterns (phonesthemes) create navigable gradients in visual space. Nonsense words constructed from these patterns produce significantly more coherent outputs than random strings.
Discovery of "Cryptids": Identified three novel visual entities (snudgeoid, crashax, broomix) that were constructed entirely from sound symbolism with zero training data contamination.

4. Key Results

Study 1 Results

Convergence: Through self-distillation, the "hit rate" (outputs approximating the target) improved from 8.1% (Round 1) to 70% (Round 3).
Inverse Navigation:
- Base SD1.5 with "Push + Pull" conditioning produced "eldritch" (structurally broken) outputs.
- LoRA-equipped models with the same conditioning produced "uncanny valley" outputs: anatomically plausible but "precisely wrong" (e.g., gaunt, hollow). This phenomenon is termed "coherence drag," where the LoRA pulls sparse regions back toward human-recognizable territory.
Stability: The identity basin remained stable across different CFG values (5–11), indicating robust attractor structure rather than parameter artifacts.
Phase Transitions: At LoRA weights between 0.50 and 0.75, outputs jumped discretely between attractors rather than blending, suggesting sharp decision boundaries in latent space.

Study 2 Results

Statistical Significance: Phonestheme candidates achieved a mean Purity@1 of 0.371, significantly higher than random pronounceable controls (0.209, $p < 0.00001$ , Cohen's $d = 0.55$ ).
Perfect Coherence: Three candidates achieved Purity@1 = 1.0 with zero contamination:
- Snudgeoid: Generated consistent mechanical/robotic humanoids (derived from sn- + -udge- + -oid).
- Crashax: Generated consistent dune buggy/off-road vehicles (derived from cr- + -ash- + -ax).
- Broomix: Generated consistent Franco-Belgian comic characters (derived from broom + -ix).
Contamination: Many high-scoring candidates were disqualified because they mapped to real-world concepts (e.g., drudgea $\to$ Drudge Report), proving that the model often retrieves memorized concepts rather than constructing new ones unless the phonological structure is sufficiently novel.

5. Significance and Implications

Latent Space Cartography: The paper suggests that diffusion model latent spaces are more structured and navigable than previously assumed. They contain "morphological gradients" that can be mapped using linguistic probes.
Construction vs. Retrieval: The work clarifies the boundary between retrieval (memorizing existing data) and construction (synthesizing new concepts from sub-lexical components). The model performs a form of statistical morphological analysis, mapping sound-symbolic patterns to visual regions.
New Personalization Paradigm: It challenges the necessity of reference images for personalization, suggesting that deep cultural concepts can be accessed via their constituent feature intersections.
Interpretability: The findings imply that the text encoder (CLIP) internalizes sound-symbolic associations (phonesthemes) even without explicit instruction, allowing for the "invention" of visual concepts through linguistic structure alone.

In conclusion, the paper establishes that morphological structure—whether in the form of feature descriptors for training or phonological patterns for prompting—acts as a systematic navigation tool within the latent spaces of diffusion models, enabling both the precise targeting of memorized identities and the generation of novel, coherent visual concepts.

Morphological Addressing of Identity Basins in Text-to-Image Diffusion Models

1. The "Marilyn" Puzzle (Study 1)

2. The "Crungus" Hunt (Study 2)

The Big Picture: The Library is Organized

1. Problem Statement

2. Methodology

Study 1: Training-Level Morphology (Identity Navigation)

Study 2: Prompt-Level Morphology (Phonestheme Navigation)

3. Key Contributions

4. Key Results

Study 1 Results

Study 2 Results

5. Significance and Implications

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation