Trained Persistent Memory for Frozen Encoder--Decoder LLMs: Six Architectural Methods

The Big Problem: The "Goldfish" AI

Imagine you have a very smart friend (an AI) who is incredibly knowledgeable about the world. However, this friend has a terrible memory: they forget everything the moment you stop talking to them.

If you tell them, "I love pizza," and then ask them 10 minutes later, "What do I like?", they will have no idea. They are "stateless." Every time you start a new conversation, they are a blank slate.

Current AI assistants try to solve this by keeping a text notebook outside the brain. They write down your secrets in a document, search that document when you ask a question, and then read the answer back to you. This works, but it's clunky. It's like asking a chef to stop cooking, run to a library to find a recipe book, read a page, and then come back to the kitchen.

The Paper's Solution: A "Brain Implant"

This paper proposes a different idea. Instead of writing things down on paper (text), what if we could install a tiny, permanent memory chip directly inside the AI's brain?

The researchers took a frozen, pre-trained AI (like a high-performance engine that they weren't allowed to rebuild) and attached a small, trainable "adapter" (a memory chip) to it. This chip lives in the latent space—which is just a fancy way of saying the "mathematical thoughts" inside the AI, rather than the words it speaks.

The Analogy:
Think of the AI's brain as a massive, frozen library. You can't change the books on the shelves (the frozen weights). But, you can add a smart librarian (the adapter) who sits in the lobby.

The Old Way: The librarian writes your request on a sticky note, runs to the archives, finds the note, and brings it back.
This Paper's Way: The librarian has a special, glowing mental notepad that updates instantly. When you speak, the librarian writes the thought directly onto this notepad in a secret code the library understands. When you ask a question later, the librarian instantly checks the notepad and whispers the answer to the library.

How They Tested It: The "Six Architectures"

The researchers didn't just guess one way to do this. They built six different types of memory chips to see which one worked best. They tested them on a "frozen" AI (Flan-T5-XL) using a single dataset of long conversations.

Here are the six methods, simplified:

The Prefix (M.1): Like sticking a sticky note on the front of the AI's input before it even reads it.
The Parallel Stream (M.2): Like giving the AI a second pair of eyes that looks at the memory while the main eyes look at the current question.
The Extended Key (M.3): Like adding extra pages to the back of the current book so the AI can read them while it's reading the main story.
The Associative Net (M.4): Like a spiderweb. When a new thought comes in, it connects to old thoughts based on how similar they are (like how your brain connects "Paris" to "Eiffel Tower").
The Gated Stream (M.5): Like a bouncer at a club. It decides only to let important memories into the AI's brain when they are relevant.
The Slot Machine (M.6): Like a filing cabinet with 64 drawers. The AI picks the best drawer to write in and overwrites old stuff if the cabinet is full.

The Results: Size Matters

The researchers tested these six methods with two different sizes of "memory cabinets":

Small Cabinet (1x): Only 64 slots.
Big Cabinet (10x): 640 slots.

The Shocking Discovery:

At the Small Size: Three of the six methods completely failed. They were like trying to hold water in a sieve; the memory just leaked out. The AI forgot everything almost immediately.
At the Big Size: All six methods worked! The AI could remember facts from 300 turns ago.

The Winners:

At Small Size: The "Parallel Stream" (M.2) and "Slot Machine" (M.6) were the champions. They were efficient enough to work with limited space.
At Big Size: The "Associative Net" (M.4) became the strongest. When given enough room, the method that connects ideas like a spiderweb was the best at remembering.

Why This Is a Big Deal

It's "Conversational Learning": Usually, AI needs to be retrained from scratch to learn new things. Here, the AI learns while you talk to it. You tell it your name in Session 1, and it remembers it in Session 10 without needing a massive context window (a huge text box).
It's Efficient: The "brain" (the main AI) stays frozen and unchanged. Only the tiny memory adapter is trained. This means you can take any existing AI and give it a memory upgrade without rebuilding the whole thing.
It's Scalable: Because the memory is just a small array of numbers (not a giant text file), you can make the memory bank huge (millions of slots) without slowing down the AI.

The Bottom Line

This paper is a proof-of-concept. It's like building a prototype car engine in a garage to prove that a new type of fuel works. The results aren't perfect yet (the AI only remembered about 10-12% of the facts perfectly), but it proves the concept is possible.

The authors argue that if we take this same idea, use a much bigger AI, and give it a memory bank the size of a library, we could create AI assistants that truly "learn" from every conversation they have, just like humans do.

In short: They figured out how to give a forgetful AI a permanent, internal memory chip that updates itself in real-time, proving that even a "frozen" brain can learn to remember.

1. Problem Statement

The Stateless Limitation: Standard frozen encoder–decoder Large Language Models (LLMs), such as Flan-T5, are stateless. Their latent representations ( $Z_t$ ) are discarded after every forward pass. Consequently, they cannot retain information across different conversational sessions. If a user states a fact in Session 1 and asks about it in Session 3, the model cannot recall it because no state persists.

The Gap in Existing Solutions: Current long-term memory systems (e.g., MemGPT, MemoryBank) operate at the text level. They store and retrieve natural language strings via external databases. This approach is distinct from the model's internal logic and relies on pre- or post-processing steps rather than being an integral part of the forward pass.

The Core Challenge: The paper aims to implement persistent memory directly in the continuous latent space of a frozen LLM. The goal is to create a memory bank ( $P_t$ ) that holds dense encoder representations, allowing the model to "learn" from conversations without retraining the massive backbone weights. The challenge is ensuring the frozen decoder, which was pre-trained only on current encoder outputs, can effectively utilize this accumulated latent memory.

2. Methodology

The study proposes a framework where a small, trainable memory adapter ( $\theta_{Mem}$ ) is added to a frozen backbone ( $E_{frozen}, D_{frozen}$ ). The system operates in two phases:

Training (Type 1): The adapter parameters are optimized via backpropagation to learn how to write to and read from the memory bank to minimize task loss.
Inference/Conversational Learning (Type 2): The adapter is frozen, but the memory bank $P_t$ continues to accumulate updates from new turns without gradients, enabling the model to improve over time.

The Six Architectural Methods

The paper designs and evaluates six methods spanning three injection points and four write mechanisms:

Injection Points:

Before Encoder (M.1): Memory is projected as soft tokens prepended to the input.
Inside Decoder (M.2, M.5): Memory is injected via parallel cross-attention layers or gated branches within decoder blocks.
Decoder KV Extension (M.3, M.4, M.6): Memory is projected into Key-Value pairs concatenated with the current encoder output for the decoder's cross-attention.

Write Mechanisms:

Attention-Coupled Update (M.1, M.2, M.3, M.5): Uses a standard attention mechanism ( $Q, K, V$ ) where values are drawn from the current latent $Z_t$ and aggregated into the memory bank based on content similarity.
Hebbian / Associative (M.4): Uses an outer-product rule ( $Z_t^T Z_t$ ) to update an associative matrix, mimicking biological synaptic strengthening.
Sparse Slot Addressing (M.6): Inspired by Neural Turing Machines; updates only the top- $k$ memory slots addressed by the current latent.

Key Design Constraint: All methods preserve the "primary path" of the frozen encoder-decoder. The memory influence is additive or injected via controlled pathways, ensuring the frozen decoder's pre-trained cross-attention mechanisms are not disrupted by invalid input distributions.

3. Evaluation Protocol

The study introduces a novel Headroom-Normalized Forgetting Curve evaluation on the LoCoMo benchmark (a long-term conversational memory dataset).

Metric: Memory Recall Rate ( $\rho$ $ρ$ ).
- It measures the F1-score gain provided by the persistent memory compared to the same model with memory zeroed out.
- It is normalized by the "headroom" (the maximum possible improvement from the zero-memory baseline to the gold standard).
- Scale: 0% (no contribution) to 100% (perfect recall).
Baseline: A stateless model with no memory. By definition, its recall rate is 0% across all lags.
Variables:
- Evidence Lag: The number of turns between the fact being stated and the question being asked.
- Capacity: Two scales tested: 1× (small bank) and 10× (large bank).

4. Key Results

Capacity Dependence

1× Capacity (Small Bank): Three methods collapsed to near-zero performance (M.1 Prefix, M.3 KV Ext, M.5 Gated). Only M.2 (Parallel XAttn), M.6 (Slot), and M.4 (Hebbian) showed positive recall.
- Winner: M.2 XAttn and M.6 Slot dominated at low capacity.
10× Capacity (Large Bank): All six methods produced positive, non-trivial memory-recall curves.
- Winner: M.4 Hebbian achieved the highest long-lag retention and mean score, followed closely by M.3 and M.6.
Conclusion: Memory bank size is a critical design parameter. Methods with selective write mechanisms (attention-coupled, Hebbian, sparse slots) are more robust to capacity constraints.

Knowledge Accumulation

The study measured "Cumulative Knowledge" over 30 sessions.
The best methods (M.2, M.4, M.6) showed steady knowledge growth ( $\Delta K \approx 7-9\%$ ), proving the system can accumulate facts over time without retraining.
Collapsed methods showed no growth.

Adapter Interference

The paper analyzed whether the adapter degrades the model's original capabilities when memory is empty ("Adapter Tax").
Results showed the tax is modest (2–4%).
At 10× capacity, the Net Benefit (Memory Gain - Adapter Tax) was positive for all methods, confirming that the memory contribution outweighs the interference.

5. Key Contributions

Latent-Space Persistent Memory: Demonstrated that persistent memory can exist entirely within the continuous latent space of a frozen LLM, using differentiable operations rather than text-level retrieval.
Architectural Taxonomy: Defined and implemented six distinct methods covering three injection points and four write mechanisms, providing a comparative design space for future research.
Evaluation Protocol: Introduced a headroom-normalized forgetting curve metric that isolates the specific contribution of persistent memory, avoiding confounding factors from general QA ability.
Feasibility Proof: Proved that even with a single frozen 3B-parameter backbone and minimal adapters, a system can achieve "conversational learning" (accumulating knowledge across sessions).

6. Significance and Future Work

Feasibility Baseline: This pilot study establishes that latent-space memory is viable under severe resource constraints. It proves the bottleneck is not the frozen decoder's inability to attend to memory, but rather the quality of the write/read pathways and memory capacity.
Scalability: Unlike text-based memory which scales linearly with token count (increasing inference cost), latent memory is a compact numerical array. It can theoretically scale to millions of slots with negligible impact on per-turn inference cost.
Future Directions: The authors argue that full end-to-end training (unfreezing the backbone) on larger models (70B+) with massive datasets and memory banks will yield substantially stronger results. The current work provides the necessary taxonomy and evaluation framework for such industrial-scale efforts.
Cognitive Analogy: The framework mirrors human cognitive systems (episodic vs. semantic memory), suggesting that latent-space memory is a more natural substrate for LLM cognition than symbolic text buffers.

In summary, the paper successfully demonstrates that frozen LLMs can be retrofitted with persistent, learnable memory using small adapters, enabling them to retain and recall information across sessions, provided the memory architecture and capacity are appropriately designed.