Length Generalization Bounds for Transformers

This paper resolves the open problem of computable length generalization bounds for transformers by proving that such bounds are non-computable for CRASP (and thus general transformers) with just two layers, while establishing optimal exponential bounds for the positive fragment of CRASP and fixed-precision transformers.

Andy Yang, Pascal Bergsträßer, Georg Zetzsche, David Chiang, Anthony W. Lin

Published 2026-03-04
📖 5 min read🧠 Deep dive

The Big Question: Can AI "Learn to Swim" in Deep Water?

Imagine you are teaching a child to swim. You start them in a kiddie pool (short sentences). They learn to float and kick perfectly. The big question is: If you take them to the deep end of the ocean (long, complex sentences), will they still know how to swim?

In the world of Artificial Intelligence (specifically "Transformers," the brains behind models like ChatGPT), this is called Length Generalization. Can a model trained on short stories understand a novel?

For a long time, researchers hoped there was a simple rule (a "formula") that could tell us exactly how much training data we need to guarantee the AI will work on any length. If you train on 100 words, will it work on 1,000? On a million?

This paper says: "No. There is no such formula."


The Main Discovery: The "Uncomputable" Wall

The authors looked at a specific type of AI logic (called C-RASP) that acts like a blueprint for how Transformers think. They asked: Is there a mathematical limit we can calculate that guarantees the AI will never fail, no matter how long the input gets?

The Answer: No. It is mathematically impossible to calculate this limit.

The Analogy: The Infinite Maze
Imagine you are trying to find a specific exit in a maze.

  • The Good News: If the maze is small (simple logic), you can easily draw a map and say, "If you walk 50 steps, you will definitely find the exit."
  • The Bad News: The authors proved that for complex Transformers, the maze is like a hall of mirrors that keeps getting bigger the more you look at it.
    • To be 100% sure the AI understands a sentence, you might need to show it a sentence longer than the number of atoms in the universe.
    • Worse, there is no algorithm (no computer program) that can tell you how long that sentence needs to be. It's like asking a calculator to solve a math problem that has no answer.

Why does this matter?
It means that even if you have a perfect AI, there is no "magic number" of training examples that guarantees it will work on long inputs. Sometimes, no matter how much you train it, it might just fail when the story gets too long.


The Silver Lining: The "Simple" Transformers

The paper isn't entirely bad news. The authors found a specific, simpler version of these AI models (called Fixed-Precision Transformers) where we can find a limit.

The Analogy: The Ruler vs. The Tape Measure

  • Standard Transformers are like a magical tape measure that can stretch infinitely but is made of a material that sometimes snaps unpredictably. You can't predict when it will break.
  • Fixed-Precision Transformers are like a rigid ruler. It has a limit to how long it can measure, but you know exactly where that limit is.

For these simpler models, the authors found the limit.

  • The Limit: To learn a rule, you need to see examples that are exponentially long.
  • What does "Exponential" mean? Imagine you need to see a sentence with 10 words to learn a rule. For the next level, you might need 100 words. Then 1,000. Then 1 million.
  • The Catch: While we can calculate this limit, the number gets so huge so fast that it's practically impossible to train the AI on data that long. It's like saying, "To learn this, you need to read every book in the library, plus every book that will ever be written."

Why Do Transformers Struggle with Long Texts?

The paper offers a fascinating explanation for why real-world AI models often fail at long tasks (like summarizing a 100-page book).

The "Needle in a Haystack" Problem
The authors suggest that the problem isn't that the AI is "dumb." It's that the AI needs to see a "needle" (a specific pattern) in a "haystack" (a long string of text) to learn the rule.

  • If the haystack is too big, the AI might never see the needle during training.
  • Because the "safe zone" for learning is so vast (exponentially large), the AI is essentially guessing in the dark when it encounters long inputs it hasn't seen before.

The "Goldilocks" Zone
This explains why AI sometimes works great on short texts, okay on medium texts, and fails on long ones. It's not a bug; it's a fundamental mathematical limitation. The AI hasn't seen enough "long" examples to be sure the rules still apply.


Summary: The Takeaway

  1. No Magic Bullet: There is no simple formula to tell us how much data is enough to make an AI work on any length of text. For complex models, this limit is mathematically "uncomputable."
  2. The Cost of Safety: For simpler, more predictable models, we can calculate the limit, but it requires training on data so massive (exponentially large) that it's practically impossible to achieve.
  3. Real-World Impact: This explains why AI models are so sensitive to how they are trained. Small changes in settings (like learning speed or how they count words) can make the difference between an AI that understands a novel and one that gets lost after the first paragraph.

In a nutshell: We can't promise that an AI trained on short stories will automatically understand a novel. The math says the "safety net" is either non-existent or so huge it doesn't exist in practice. We have to be very careful when asking AI to handle long, complex tasks.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →