When Less is More: The LLM Scaling Paradox in Context Compression

The Big Idea: Bigger Isn't Always Better

For a long time, the rule of thumb in Artificial Intelligence has been: "Bigger is better." The assumption is that if you make a brain (a Large Language Model) bigger, it will be smarter, remember more, and do everything perfectly.

But this paper discovers a weird glitch in that rule, specifically when we try to compress information.

Imagine you have a long, detailed story about a blue-banded bee. You want to shrink this story down into a tiny "summary token" so a computer can remember it later and tell the story again.

The Expectation: You'd think a super-smart, giant AI (90 Billion parameters) would be the best at shrinking the story without losing details.
The Reality: The giant AI actually messes up more than a smaller, "lite" AI. It forgets the specific details and replaces them with what it thinks should be there.

The authors call this the Size-Fidelity Paradox: As the model gets bigger, it gets less faithful to the original text.

The Two Ways Big Models Fail

The paper identifies two specific ways these giant models ruin the compression. Let's use an analogy of a Photographer trying to copy a painting.

1. Knowledge Overwriting (The "Autocorrect" Problem)

What happens: The big model sees a specific fact in the text (e.g., "The blue-banded bee") but decides, "Wait, I know bees are usually honey bees. I'll fix that for you."
The Analogy: Imagine you ask a very confident art student to copy a painting of a purple horse. Because the student has studied so many pictures of real horses, they subconsciously think, "Horses aren't purple; they are brown." So, they paint a brown horse. They didn't make a mistake; they just "corrected" your input based on their own memory.
The Result: The big model overwrites your specific facts with its own general knowledge.

2. Semantic Drift (The "Paraphrase" Problem)

What happens: The model keeps the general vibe of the story but changes the relationships between things.
The Analogy: You tell the student, "Alice hit Bob." The student writes, "Bob got hit by Alice." It sounds the same, right? But in a different story, "Alice hit Bob" might mean Alice is the aggressor, while "Bob hit Alice" changes the whole meaning. Or, the student might say, "The flower vibrated to shake pollen onto the bee," when the original text said, "The bee vibrated to shake pollen."
The Result: The big model gets so good at writing fluent sentences that it starts rewriting the story in its own style, accidentally swapping who did what to whom.

Why Does This Happen? (The "Why" Behind the Glitch)

The paper digs into the "brain" of the model to find the cause. It turns out the problem isn't the size of the brain, but how that size changes the brain's behavior.

1. Too Much "Creative Space" (Semantic Capacity)

The Concept: Small models are like a tight, narrow hallway. They have to stick to the path. Big models have a massive, open warehouse.
The Analogy: If you are in a narrow hallway, you have to walk exactly where the walls tell you. If you are in a giant warehouse, you have so much freedom to move around that you might accidentally wander off the path and pick up random items (your own prior knowledge) instead of carrying the specific item you were told to carry.
The Science: The paper calls this Effective Rank. Bigger models spread information out over a huge space, making it easier for their own internal knowledge to "invade" and replace the specific facts you gave them.

2. The "Confidence Trap" (Generative Uncertainty)

The Concept: Big models are so good at predicting the next word that they get too confident in their own ideas.
The Analogy: Imagine a small model is a nervous scribe who is afraid to make a mistake, so they copy every letter exactly. A giant model is a confident author who thinks, "I know how this story goes. I'll just write it my way."
The Science: This is measured by Entropy (uncertainty). Surprisingly, the biggest models have higher uncertainty when trying to copy text exactly. They see many "plausible" ways to say something, so they pick the one that sounds best to them, rather than the one that matches the original text. They prefer creativity over accuracy.

The Solution?

The paper suggests that for tasks where exactness matters (like compressing legal documents, medical records, or specific facts), we shouldn't just keep making models bigger.

Small models are actually better at being "photocopiers" because they are forced to stick to the source text.
Big models are better at being "artists" because they have the capacity to create new things, but that makes them bad at copying things exactly.

The Takeaway

If you want a machine to remember your story exactly as you told it, don't give it the biggest brain possible. Sometimes, a smaller, more focused brain is the only one that will listen to you without trying to "fix" your story.

In short: In the world of context compression, Less (parameters) is More (faithfulness).

1. Problem Statement

The prevailing hypothesis in Large Language Model (LLM) development is that increasing model parameters (scaling) universally improves performance. However, this paper identifies a critical failure of this hypothesis in the domain of lossy context compression.

In a standard compressor-decoder setup, a compressor model maps a long input sequence into a compact set of latent "memory tokens," which a decoder then uses to reconstruct the original text. The authors observe a Size-Fidelity Paradox: beyond a certain parameter scale, larger compressor models achieve lower training loss and higher surface-level reconstruction metrics (e.g., BLEU, ROUGE) but produce less faithful reconstructions of the source text compared to smaller models.

Instead of verbatim reproduction, larger models tend to:

Overwrite source facts with their internal parametric knowledge (e.g., changing "blue-banded bee" to "honey bee").
Drift semantically, restructuring sentences or altering causal relationships while maintaining fluency (e.g., changing "the bee vibrates" to "the flower vibrates").

2. Methodology

Experimental Setup

Models: The study evaluates two major LLM families, Qwen-3 and LLaMA-3.2, spanning a parameter range from 0.6B to 90B.
Task: Compressors are trained to map input sequences to latent embeddings ( $Z$ ) at compression rates of 4×, 16×, and 64×. A fixed decoder (Meta-Llama-3-8B-Instruct) reconstructs the text.
Training: All models are trained on the FineWeb dataset using an identical protocol to isolate scale as the primary variable.

Diagnostic Evaluation Framework

To expose the failures invisible to standard metrics, the authors designed two specific diagnostic QA tasks:

Knowledge Overwriting Task: Uses counterfactual datasets (FaithEval, ConflictQA) where the input text contains deliberate factual contradictions (e.g., "Einstein was born in France"). The model is tested on its ability to recall the compressed fact rather than its internal world knowledge.
Semantic Drift Task: Uses a diagnostic dataset probing seven dimensions of structural fidelity, including entity lists, predicate exactness, causal relations, and role binding (who did what to whom). Questions require exact substring matches from the original text to prevent paraphrasing from masking errors.

Mechanistic Analysis

The authors probe the internal properties of the compressed representations ( $Z$ ) to explain why scaling causes degradation:

Semantic Capacity: Measured via the Effective Rank of memory embeddings. Higher rank implies a more distributed representation space, making it easier for parametric priors to interfere with source content.
Generative Uncertainty: Measured via the Conditional Entropy of the decoder's token predictions. Higher entropy indicates the decoder faces multiple plausible continuations, leading to "creative" rewriting rather than rigid copying.

3. Key Contributions

Identification of the Size-Fidelity Paradox: The paper establishes that in context compression, larger models do not scale linearly with fidelity. Beyond a specific threshold (around 4B–8B parameters in their experiments), increasing size leads to a monotonic decline in faithfulness despite improving training loss.
Novel Evaluation Metrics: The authors introduce Knowledge Overwriting and Semantic Drift diagnostic tasks. These reveal that standard metrics (BLEU, Perplexity) are insufficient as they reward fluency and surface similarity over factual and structural accuracy.
Mechanistic Explanation: The paper moves beyond observation to explain the root causes:
- Knowledge Overwriting is driven by increased Effective Rank. Larger models create high-dimensional, distributed embeddings that allow their internal priors to overwrite specific source details.
- Semantic Drift is driven by increased Generative Uncertainty (Entropy). Larger models, despite being more capable, exhibit higher entropy in token prediction distributions during reconstruction, favoring fluent paraphrasing over rigid structural preservation.

4. Key Results

Performance Divergence: As model size scales from 0.6B to 90B, reconstruction scores (BLEU) continue to improve or plateau, but QA accuracy on fidelity tasks drops significantly.
- Example: In the 16× compression rate, the 90B LLaMA model achieved a FaithEval accuracy of 0.55, whereas the 4B model achieved 0.71, despite the 90B model having a lower training loss.
Non-Monotonic Trends: The degradation is not random; it follows a specific trajectory where small models (0.6B–4B) improve with scale, but large models (8B–90B) degrade in fidelity.
Correlation Analysis:
- Rank vs. Fidelity: A strong negative correlation ( $r \approx -0.93$ ) exists between the effective rank of embeddings and knowledge faithfulness. Higher rank correlates with more overwriting.
- Entropy vs. Fidelity: A strong negative correlation ( $r \approx -0.82$ ) exists between conditional entropy and semantic drift accuracy. Higher entropy correlates with more structural distortion.
Ablation Studies: The paradox persists across different decoder architectures (Qwen vs. LLaMA) and decoder sizes, confirming that the issue is intrinsic to the compressor's representation space, not a decoder compatibility issue.

5. Significance and Implications

Challenge to Scaling Laws: This work challenges the universality of the scaling hypothesis, demonstrating that "more parameters" is not always "better" for tasks requiring strict information preservation.
Redefining Compression Goals: It highlights a fundamental trade-off: the very properties that enable large models to perform complex reasoning and generalization (high semantic capacity and flexible generation) are detrimental to the rigid fidelity required for context compression.
Design Guidelines: For future context compression systems, the paper suggests that simply scaling up models is counterproductive. Instead, design strategies should focus on:
- Constraining the effective rank of embeddings to prevent prior interference.
- Reducing generative uncertainty to enforce deterministic, verbatim reconstruction.
- Adopting evaluation frameworks that prioritize factual and structural integrity over surface-level fluency.

In conclusion, the paper argues that for faithful context compression, "Less is More": smaller, more constrained models often preserve source information more accurately than massive, highly capable LLMs that tend to "hallucinate" or "rewrite" based on their internal knowledge.