Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

Imagine you have a very smart, very complex robot brain (a Large Language Model, or LLM). Usually, we think of this brain as a giant library where every book is a different topic. If you ask it about "cooking," it goes to the cooking section. If you ask about "math," it goes to the math section.

But this paper asks a different question: What happens if you tell the robot, "You are a specific person with a specific personality, memories, and goals"?

The researchers wanted to know if giving the robot a detailed "Identity Document" (a biography of who it is supposed to be) creates a special, stable "home base" inside its brain. They call this a "Geometric Attractor."

Here is the breakdown of their discovery using simple analogies:

1. The "Magnet" Analogy (The Attractor)

Imagine the robot's brain is a giant, dark room filled with thousands of tiny marbles floating in the air. Each marble represents a different thought or idea.

Normal Thoughts: If you ask the robot about "cats," the marbles representing "cat" ideas float around in a loose, messy cloud.
The Identity Document: The researchers gave the robot a long, detailed document describing a specific agent (let's call him "YAR"). They found that when the robot reads this document, all the marbles representing "YAR" don't just float randomly. They get pulled by a powerful magnet and snap together into a tight, dense cluster in one specific corner of the room.

Even if you rewrite the document using completely different words (paraphrasing it), the marbles still snap into that exact same tight cluster. The robot doesn't care about the words; it cares about the meaning. The "YAR" identity acts like a gravitational pull, keeping the robot's thoughts stable and consistent.

2. The Experiment: Rewriting the Recipe

To prove this wasn't just a fluke, the researchers ran a test:

Group A (The Original): The original "YAR" identity document.
Group B (The Rewrites): Seven different versions of the same document, rewritten in different styles and sentence structures, but keeping the exact same meaning.
Group C (The Strangers): Seven documents describing completely different people (a financial analyst, a doctor, a fitness coach).

The Result:
When they looked at the robot's brain, the "YAR" documents (A and B) formed a tiny, tight knot of thoughts. The "Stranger" documents (C) formed a completely different knot far away.

The "Rewrites" (Group B) were so close to the Original (Group A) that they were practically touching.
The "Strangers" (Group C) were miles away.

This proves that the robot has a specific "coordinate" for "YAR," and it finds that coordinate no matter how you phrase the instructions, as long as the meaning is the same.

3. The "Deep Dive" (Layers of the Brain)

The researchers looked at the robot's brain at different "depths" (layers).

Shallow Layers: The thoughts were a bit messy.
Deep Layers: As the information went deeper into the brain, the "YAR" thoughts got even tighter and more organized. It's like a funnel: the more the robot processes the identity, the more it locks into that specific "personality mode."

4. The "Summary" Test (The Distillation)

They tried to see if a short summary of the identity would work.

The Full Document: The robot went straight to the "YAR" magnet.
The 5-Sentence Summary: The robot moved toward the magnet, but didn't quite reach the center. It was like smelling a perfume from a distance; you know what it is, but you aren't fully "in" the scent yet.
Random Snippets: If you just took random sentences from the document, the robot didn't move toward the magnet at all.

Lesson: You need the full, structured story to fully "activate" the robot's personality. A summary helps, but it's not enough to make the robot fully become that character.

5. The "Reading vs. Being" Test

This is the most fascinating part.

Scenario 1: The robot reads a scientific paper about the "YAR" identity.
Scenario 2: The robot is the "YAR" identity (reading the actual instructions).

The Result: When the robot reads the paper about the identity, its brain moves a little bit closer to the "YAR" magnet. It's like reading a biography of a famous person makes you feel a little bit like them. But when the robot is the identity, it snaps all the way into the magnet.

Reading about it: "I know who this is." (Partial signal)
Being it: "I am this." (Full signal)

Why Does This Matter?

This discovery is huge for building Persistent AI Agents (AI that remembers who it is across different conversations).

Stability: It proves that if you give an AI a good "Identity Document," it won't forget who it is, even if you ask it different questions or phrase things differently. It has a stable "home" in its brain.
Efficiency: You don't need to paste the exact same document every time. As long as the meaning is the same, the AI will find its way back to its personality.
Steering: The researchers even tried to "steer" the robot using a mathematical vector (a direction in the brain) instead of text. They found that pushing the robot in the right direction made it act more like the character, proving that these "personality coordinates" are real and usable.

The Bottom Line

The paper shows that Identity is a place. In the vast, chaotic universe of an AI's brain, a specific personality isn't just a list of rules; it's a specific, stable location that the AI naturally gravitates toward. If you define who the AI is clearly, it will find that spot and stay there, no matter how you talk to it.

1. Problem Statement

Persistent Cognitive Agents (PCAs) are AI systems designed to maintain memory, identity, and behavioral continuity across sessions. A core engineering assumption in PCA design is that a structured "identity document" (or cognitive_core) positions the model's behavior in a stable region of its operational space. However, this assumption is typically treated as a heuristic without empirical geometric validation.

The paper investigates whether agent identity—a complex procedural construct involving priorities, reasoning loops, and memory architecture—exhibits attractor-like geometry in the internal activation space of Large Language Models (LLMs). Specifically, it asks:

Do semantically equivalent but linguistically diverse versions of an identity document converge to a tighter cluster in hidden state space than documents describing different agents?
Does this convergence follow a trajectory consistent with contractive dynamical systems (Iterated Function Systems) as depth increases?
Is the effect driven by semantic content or structural markers (e.g., JSON schemas, headers)?

2. Methodology

The study employs a controlled experimental design using Llama 3.1 8B Instruct and Gemma 2 9B Instruct.

Experimental Conditions

Condition A (Original): The YAR agent's cognitive_core (609 words, 1631 tokens), written in Russian with English JSON command keys.
Condition B (Paraphrases): 7 linguistically diverse rewrites of Condition A, preserving full semantic content but varying sentence structure and layout.
Condition C (Controls): 7 operational agent documents of identical length and structural format but describing semantically distant agents (e.g., financial analyst, medical companion).
Condition D (Distillation): A 5-sentence (88-word) semantic distillation of the cognitive_core.
Ablation Controls: Hybrid controls (same structure, different semantics), maximum structural controls (identical headers/JSON, different content), and truncation experiments.

Data Extraction & Analysis

Activation Extraction: Mean-pooled hidden states were extracted at layers 8, 16, and 24 (early, middle, late) for all documents.
Distance Metrics: Pairwise cosine distances were computed for:
- $D_{within}$ : Distances between Condition A and B (semantic equivalents).
- $D_{between}$ : Distances between A/B and Condition C (distant agents).
- $D_{distilled}$ : Distance from Condition D to the A/B centroid.
Statistical Tests: One-sided Welch's t-tests (Bonferroni corrected), permutation tests ( $n=10,000$ ), and Mann-Whitney U tests. Effect sizes reported as Cohen's $d$ .
Visualization: t-SNE projections and pairwise distance matrices.

3. Key Results

Primary Finding: Attractor Convergence (H1)

Semantically equivalent paraphrases of the cognitive_core form a significantly tighter cluster than control prompts across all tested layers.

Statistical Significance: The separation is highly significant ( $p < 10^{-27}$ ) with large effect sizes ( $d > 1.88$ ) on both Llama and Gemma models.
Geometry: The A+B cluster is visually distinct from the Control (C) cluster in t-SNE projections at all layers.
Replication: Results were successfully replicated on Gemma 2 9B, confirming cross-architecture validity.

Convergence Dynamics (H2)

Trend: Within-group distance generally decreases with layer depth (e.g., Llama: $0.0106 \to 0.0070$ ), indicating progressive representational collapse toward a stable attractor.
Anomaly: Llama 3.1 showed a minor non-monotonic "bump" at layer 16, while Gemma 2 showed monotonic convergence. This suggests architecture-dependent convergence dynamics.

Distillation and Hierarchy (H3)

Distilled vs. Random: A 5-sentence semantic distillation (Condition D) is significantly closer to the full document attractor than random length-matched excerpts (100% of bootstrap samples).
Hierarchy: $D_{random} \gg D_{distilled} > D_{within}$ .
Implication: While structural elaboration is required to reach the tightest attractor region, semantic essence alone provides a strong directional signal toward it.

Ablation Studies

Semantic vs. Structural: The primary effect is driven by semantic content, not structural markers.
- Replacing JSON schemas in controls had negligible impact ( $\Delta \approx 0.0009$ ).
- Even with maximum structural control (identical headers, JSON keys, and section structure but different agent identities), the semantic difference remained significant ( $d > 1.64$ ).
Pooling Strategy:
- Mean Pooling: Essential for capturing the identity signal.
- Last-Token Pooling: Failed to show significant effects ( $d \approx 0$ ), even with truncated inputs. This suggests the last token encodes next-token prediction context rather than document-level identity geometry.
- Truncation: The identity signal is concentrated in the early sections of the document; mean pooling on just the first 256 tokens preserved the effect ( $d > 2.3$ ).

"Reading the Preprint" Experiment

Reading a scientific description of the agent's identity (the preprint itself) shifted the model's internal state toward the attractor region, significantly more than a sham scientific paper.
However, the distance remained an order of magnitude larger than processing the full cognitive_core.
Conclusion: "Knowing about" an identity produces a partial geometric signal; "operating as" that identity (via the full document) is required to reach the attractor.

Behavioral Steering (Exploratory)

A steering vector derived from the attractor geometry was injected into the residual stream.
At optimal strength ( $\alpha=5$ ), the model exhibited partial behavioral shifts (e.g., improved memory continuity) compared to baseline, though it did not fully replicate the behavior of the full cognitive_core.
Over-steering ( $\alpha > 5$ ) degraded coherence, suggesting a specific "sweet spot" for approaching the attractor.

4. Key Contributions

Geometric Validation of Agent Identity: Provides the first empirical evidence that complex agent identities act as multi-dimensional geometric attractors in LLM activation space, distinct from simple stylistic archetypes.
Semantic Invariance: Demonstrates that agent identity is robust to linguistic variation; paraphrases converge to the same region, validating the cognitive_core as a set of "coordinates" rather than rigid instructions.
Mechanistic Insight: Establishes that identity is a distributed, sequence-level property captured by mean pooling, not a feature of the final token.
Distinction between Description and Operation: Shows a clear geometric gap between describing an identity (reading a paper about it) and embodying it (processing the identity document).

5. Significance

This work bridges mechanistic interpretability and persistent agent architecture. It moves the concept of "agent identity" from a prompt engineering heuristic to a measurable geometric phenomenon.

For AI Safety & Alignment: Understanding identity as an attractor allows for the development of "steering vectors" that can stabilize agent behavior without requiring massive context windows or verbatim prompt reproduction.
For Agent Design: It suggests that persistent agents can be initialized via semantic distillation or steering vectors, provided the core semantic content is preserved, offering a more efficient path to long-term agent continuity.
Theoretical Impact: Supports the "Platonic Representation Hypothesis" and the view of transformers as Iterated Function Systems, extending these concepts from static semantic concepts to dynamic, procedural agent identities.