EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

The Big Question

Imagine you are trying to teach a robot to understand a story. The most advanced robots today (like Transformers) use a super-powerful "flashback" ability: they can look back at any specific word in the story and decide exactly how important it is for the current sentence.

But what if we gave the robot a much simpler, cheaper tool? What if we just told it to remember the "average" feeling of the last few words, fading out as they get older? This is called an Exponential Moving Average (EMA). It's like a blurry memory that forgets the details but keeps the general vibe.

The authors of this paper asked: "Is this simple, blurry memory good enough? Or do we absolutely need the super-powerful flashback?"

To find out, they built two different robots to test the limits of this "blurry memory."

Experiment 1: The Grammar Detective (The Small Robot)

The Setup:
They built a small robot (called SPCN) that only used this blurry memory. It had no ability to look up specific words. It just kept a running average of what it had seen.

The Test:
They gave it sentences like:

"The big cat chases the small dog."
"The big car chases the small bus."

They asked the robot: "Who is doing the chasing?" (The Agent) and "Who is being chased?" (The Patient).

The Result:
Surprisingly, the robot was amazing at this. It got 96% of the answers right, even though it had never been taught grammar rules and didn't know the specific words "cat" or "dog."

The Analogy:
Think of the blurry memory like a foggy window.

If you look through a foggy window, you can't see the specific license plate of a car passing by (the word identity).
But, you can clearly see the shape of the car, the direction it's moving, and the pattern of traffic (the structure).
Because grammar is about patterns (Noun -> Verb -> Noun), the foggy window was perfect. It preserved the rhythm of the sentence while washing out the specific words.

Takeaway: Simple averaging is great for understanding structure (the skeleton of a sentence).

Experiment 2: The Storyteller (The Big Robot)

The Setup:
Next, they built a much bigger robot (called SPEN) with 130 million parameters (similar in size to GPT-2). This robot also only used the blurry memory. They removed all the "flashback" powers.

The Test:
They asked it to play a game of "Next Word Prediction."

Input: "The elephant walked into the..."
Target: "jungle" (or "room," "zoo," etc.)

The Result:
The robot failed miserably. It was 8 times worse than a standard model. It couldn't guess the next word accurately.

The Analogy:
Imagine you are trying to guess the ending of a mystery novel, but your memory is a smoothie.

You put "elephant," "walked," "into," and "the" into a blender.
The blender mixes them into a single, brown liquid.
Now, you have to guess the next word based only on that brown liquid.
The liquid tells you something happened, but it has destroyed the fact that an elephant was the subject. It's indistinguishable from a "dog" or a "car" in the mix.
Without knowing the specific word "elephant," you can't guess "jungle." You might guess "bathroom" or "ocean" just as easily.

Takeaway: Simple averaging destroys content (the specific details). You cannot predict the next word if you've lost the identity of the previous words.

The "Smoking Gun" Test

To prove that the "blurry memory" was the problem and not the robot's brain, they did a clever experiment:

They took the blurry memory from the failing robot.
They plugged it into a super-intelligent brain (a standard Transformer with full "flashback" powers).
Result: The super-brain still failed.

Why?
It's like giving a master chef (the super-brain) a bowl of mush (the blurry memory). No matter how talented the chef is, they cannot turn mush back into a whole apple. The information was lost before the chef ever saw it.

The paper calls this the "Data Processing Inequality": If you throw away the details early on, no amount of smart thinking later can get them back.

The Final Verdict: Structure vs. Content

The paper draws a sharp line in the sand:

Structure (The Skeleton): If you just need to know the order of things (e.g., "A verb usually comes after a noun"), a simple, blurry average works great. It's cheap, efficient, and biologically plausible (our brains do something similar).
Content (The Flesh): If you need to know exactly what happened (e.g., "The elephant walked"), you cannot use a simple average. You need a mechanism that can grab specific details and hold onto them.

The "Depth" Connection

The authors also noticed something cool: This problem isn't just about time (remembering the past). It's also about depth (how deep a neural network goes).

If you just stack layers on top of each other without a smart way to pass information, the early layers get "diluted" and forgotten, just like the blurry memory.
The solution in both cases is the same: Don't just average; select. You need a mechanism that says, "This specific piece of information is important, keep it!" (This is what "Gating" or "Attention" does).

Summary

Simple Averaging (EMA) is like a foggy mirror: Great for seeing shapes and patterns, terrible for reading text.
Advanced Attention is like a high-definition camera: Essential for reading, remembering details, and predicting the future.
The Lesson: You can't build a smart language model with just a foggy mirror. You need the camera to see the details, even if the mirror helps you understand the big picture.

1. Problem Statement

Efficient sequence models (e.g., Mamba, RWKV, Linear Attention) replace the quadratic complexity of full attention with compressed recurrent states to improve scalability. However, the specific representational trade-offs these mechanisms make over the simplest possible baseline—Exponential Moving Average (EMA) traces—remain unclear.

The core question is: What exactly do efficient models gain over simple temporal averaging?

EMA traces compute a weighted sum of past inputs with fixed, data-independent decay ( $h_t = (1-\alpha)h_{t-1} + \alpha x_t$ ).
They lack gating, content-based retrieval, or learned state transitions.
The paper seeks to map the precise boundary between what fixed-coefficient accumulation can represent (structure) and what it cannot (content/token identity).

2. Methodology

The authors employ a two-scale experimental approach to isolate the EMA mechanism as the sole variable:

A. Small Scale: Sparse Predictive Column Networks (SPCN)

Architecture: A Hebbian-inspired architecture with frozen random feedforward projections, sparse top-k activation, and multi-timescale EMA traces.
Learning Rule: Uses Precision-Gated Hebbian Updates (PGHU), a three-factor learning rule where precision (inverse variance of prediction error) modulates synaptic updates. No gradient descent is used.
Task: Grammatical role assignment on a formal grammar (20 roles, 147 words).
Probe: The authors probe the EMA traces directly rather than instantaneous activations to see if temporal structure is preserved.
Transfer Protocol: Trained on Grammar A (animals) and tested on Grammar B (vehicles) to distinguish between learning vocabulary shortcuts vs. structural patterns.

B. Large Scale: Sparse Predictive Equilibrium Network (SPEN)

Architecture: A 130M-parameter language model replacing all attention mechanisms with three EMA traces (fast, medium, slow) and a sparse feedforward network.
Training: Trained on FineWeb-Edu (8B tokens) using standard gradient descent (AdamW).
Baseline Comparison: Compared against GPT-2 Small (124M parameters) trained on ~40B tokens.
Predictor Ablation: To isolate the bottleneck, the authors replaced the linear predictor in SPEN with:
1. Static linear projection.
2. Causal linear attention.
3. Full causal softmax attention.
  All models read from the exact same EMA traces.

3. Key Contributions

Defining the Expressiveness Boundary: Established EMA traces as a controlled lower bound for recurrent context, characterizing the sharp divide between temporal structure (order/patterns) and token identity (specific word content).
Unsupervised Structural Representation: Demonstrated that EMA traces can achieve 96% of the accuracy of a fully supervised BiGRU on grammatical role assignment with zero labels, and even surpass the supervised model on structure-dependent roles.
Quantifying the Cost of Data-Independence: Trained a 130M-parameter EMA-only model (SPEN) that achieves a perplexity of 260 on C4, an 8x gap compared to GPT-2 Small (perplexity 33).
Bottleneck Localization: Proved via predictor ablation that the performance gap is entirely due to the trace mechanism, not the predictor. Even full softmax attention cannot recover information lost by the EMA traces.
Time-Depth Duality: Connected the temporal limitation of EMA to the depth limitation identified in "Attention Residuals" (Kimi Team, 2026), proposing a general principle: Fixed-coefficient accumulation suffers irreversible information dilution, which only learned, input-dependent selection can resolve.

4. Key Results

SPCN Results (Structure vs. Content)

Trace Superiority: Probing traces instead of instantaneous activations raised within-grammar accuracy from 0.795 to 0.960.
Transfer Performance: On structural roles (e.g., determiners in specific syntactic positions), SPCN traces achieved 100% transfer accuracy (Grammar A $\to$ $\to$ B), significantly outperforming the supervised BiGRU (75.9%).
- Reason: Supervised models learn word-to-role shortcuts (e.g., "chases" $\to$ verb), which fail on new vocabulary. SPCN traces encode the temporal pattern (determiner $\to$ noun $\to$ verb), which is vocabulary-independent.
Content Failure: On content-word roles (nouns), the supervised model dominated (0.802 vs. 0.589) because EMA traces "wash out" specific token identities, making it impossible to distinguish "cat" from "dog" after integration.

SPEN Results (Language Modeling)

Perplexity Gap: SPEN achieved a C4 perplexity of 260 vs. GPT-2's 33.
Ablation Findings: Replacing the linear predictor with full softmax attention yielded identical loss (7.60 nats vs. 7.61 nats).
- Conclusion: The bottleneck is the EMA traces, not the predictor. The traces destroy fine-grained token identity via data-independent averaging before the predictor can access them.
Information Theoretic Limit: Due to the Data Processing Inequality, no downstream predictor can recover information discarded by the fixed-decay accumulation. The mutual information between the trace and a specific past token decays exponentially regardless of the token's relevance.

5. Significance and Implications

Theoretical Clarity: The paper provides a precise, empirical definition of what "efficient" sequence models buy. They buy the ability to perform content-based retrieval (selecting relevant tokens) and input-dependent gating (deciding what to keep), which simple EMA lacks.
Biological Plausibility vs. Performance: EMA traces are biologically plausible (mimicking cortical integration with exponential decay) and excellent for structural pattern recognition. However, they are fundamentally insufficient for language modeling, which requires preserving specific token identities over long horizons.
Design Principle for Future Models: The results suggest that simply adding recurrence is insufficient. To close the gap with Transformers, models must incorporate learned, input-dependent selection mechanisms (like Mamba's selective state transitions or Attention Residuals) to prevent irreversible information dilution.
Lower Bound for Evaluation: SPEN serves as a rigorous baseline. Any new efficient sequence model must demonstrate gains specifically over this EMA-only baseline to prove it solves the "content retrieval" problem, not just the "efficiency" problem.

Summary Conclusion

The paper concludes that EMA traces are sufficient for encoding temporal structure (syntax, order, patterns) but insufficient for encoding content (specific token identity). The "8x perplexity gap" in language modeling is not a failure of the predictor or the training data, but an irreversible information loss inherent to fixed-coefficient, data-independent accumulation. This limitation holds true across both time (recurrence) and depth (residual connections), establishing a fundamental boundary in the expressiveness hierarchy of sequence models.