When Less is More: The LLM Scaling Paradox in Context Compression

This paper identifies a "Size-Fidelity Paradox" in context compression where scaling up model parameters paradoxically reduces reconstruction faithfulness due to knowledge overwriting and semantic drift, revealing that excessive generative capacity and entropy, rather than parameter count itself, undermine the accurate preservation of compressed contexts.

Ruishan Guo, Yibing Liu, Guoxin Ma, Yan Wang, Yueyang Zhang, Long Xia, Kecheng Chen, Zhiyuan Sun, Daiting Shi

Published 2026-02-27
📖 5 min read🧠 Deep dive

The Big Idea: Bigger Isn't Always Better

For a long time, the rule of thumb in Artificial Intelligence has been: "Bigger is better." The assumption is that if you make a brain (a Large Language Model) bigger, it will be smarter, remember more, and do everything perfectly.

But this paper discovers a weird glitch in that rule, specifically when we try to compress information.

Imagine you have a long, detailed story about a blue-banded bee. You want to shrink this story down into a tiny "summary token" so a computer can remember it later and tell the story again.

  • The Expectation: You'd think a super-smart, giant AI (90 Billion parameters) would be the best at shrinking the story without losing details.
  • The Reality: The giant AI actually messes up more than a smaller, "lite" AI. It forgets the specific details and replaces them with what it thinks should be there.

The authors call this the Size-Fidelity Paradox: As the model gets bigger, it gets less faithful to the original text.


The Two Ways Big Models Fail

The paper identifies two specific ways these giant models ruin the compression. Let's use an analogy of a Photographer trying to copy a painting.

1. Knowledge Overwriting (The "Autocorrect" Problem)

  • What happens: The big model sees a specific fact in the text (e.g., "The blue-banded bee") but decides, "Wait, I know bees are usually honey bees. I'll fix that for you."
  • The Analogy: Imagine you ask a very confident art student to copy a painting of a purple horse. Because the student has studied so many pictures of real horses, they subconsciously think, "Horses aren't purple; they are brown." So, they paint a brown horse. They didn't make a mistake; they just "corrected" your input based on their own memory.
  • The Result: The big model overwrites your specific facts with its own general knowledge.

2. Semantic Drift (The "Paraphrase" Problem)

  • What happens: The model keeps the general vibe of the story but changes the relationships between things.
  • The Analogy: You tell the student, "Alice hit Bob." The student writes, "Bob got hit by Alice." It sounds the same, right? But in a different story, "Alice hit Bob" might mean Alice is the aggressor, while "Bob hit Alice" changes the whole meaning. Or, the student might say, "The flower vibrated to shake pollen onto the bee," when the original text said, "The bee vibrated to shake pollen."
  • The Result: The big model gets so good at writing fluent sentences that it starts rewriting the story in its own style, accidentally swapping who did what to whom.

Why Does This Happen? (The "Why" Behind the Glitch)

The paper digs into the "brain" of the model to find the cause. It turns out the problem isn't the size of the brain, but how that size changes the brain's behavior.

1. Too Much "Creative Space" (Semantic Capacity)

  • The Concept: Small models are like a tight, narrow hallway. They have to stick to the path. Big models have a massive, open warehouse.
  • The Analogy: If you are in a narrow hallway, you have to walk exactly where the walls tell you. If you are in a giant warehouse, you have so much freedom to move around that you might accidentally wander off the path and pick up random items (your own prior knowledge) instead of carrying the specific item you were told to carry.
  • The Science: The paper calls this Effective Rank. Bigger models spread information out over a huge space, making it easier for their own internal knowledge to "invade" and replace the specific facts you gave them.

2. The "Confidence Trap" (Generative Uncertainty)

  • The Concept: Big models are so good at predicting the next word that they get too confident in their own ideas.
  • The Analogy: Imagine a small model is a nervous scribe who is afraid to make a mistake, so they copy every letter exactly. A giant model is a confident author who thinks, "I know how this story goes. I'll just write it my way."
  • The Science: This is measured by Entropy (uncertainty). Surprisingly, the biggest models have higher uncertainty when trying to copy text exactly. They see many "plausible" ways to say something, so they pick the one that sounds best to them, rather than the one that matches the original text. They prefer creativity over accuracy.

The Solution?

The paper suggests that for tasks where exactness matters (like compressing legal documents, medical records, or specific facts), we shouldn't just keep making models bigger.

  • Small models are actually better at being "photocopiers" because they are forced to stick to the source text.
  • Big models are better at being "artists" because they have the capacity to create new things, but that makes them bad at copying things exactly.

The Takeaway

If you want a machine to remember your story exactly as you told it, don't give it the biggest brain possible. Sometimes, a smaller, more focused brain is the only one that will listen to you without trying to "fix" your story.

In short: In the world of context compression, Less (parameters) is More (faithfulness).

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →