Length Generalization Bounds for Transformers

The Big Question: Can AI "Learn to Swim" in Deep Water?

Imagine you are teaching a child to swim. You start them in a kiddie pool (short sentences). They learn to float and kick perfectly. The big question is: If you take them to the deep end of the ocean (long, complex sentences), will they still know how to swim?

In the world of Artificial Intelligence (specifically "Transformers," the brains behind models like ChatGPT), this is called Length Generalization. Can a model trained on short stories understand a novel?

For a long time, researchers hoped there was a simple rule (a "formula") that could tell us exactly how much training data we need to guarantee the AI will work on any length. If you train on 100 words, will it work on 1,000? On a million?

This paper says: "No. There is no such formula."

The Main Discovery: The "Uncomputable" Wall

The authors looked at a specific type of AI logic (called C-RASP) that acts like a blueprint for how Transformers think. They asked: Is there a mathematical limit we can calculate that guarantees the AI will never fail, no matter how long the input gets?

The Answer: No. It is mathematically impossible to calculate this limit.

The Analogy: The Infinite Maze
Imagine you are trying to find a specific exit in a maze.

The Good News: If the maze is small (simple logic), you can easily draw a map and say, "If you walk 50 steps, you will definitely find the exit."
The Bad News: The authors proved that for complex Transformers, the maze is like a hall of mirrors that keeps getting bigger the more you look at it.
- To be 100% sure the AI understands a sentence, you might need to show it a sentence longer than the number of atoms in the universe.
- Worse, there is no algorithm (no computer program) that can tell you how long that sentence needs to be. It's like asking a calculator to solve a math problem that has no answer.

Why does this matter?
It means that even if you have a perfect AI, there is no "magic number" of training examples that guarantees it will work on long inputs. Sometimes, no matter how much you train it, it might just fail when the story gets too long.

The Silver Lining: The "Simple" Transformers

The paper isn't entirely bad news. The authors found a specific, simpler version of these AI models (called Fixed-Precision Transformers) where we can find a limit.

The Analogy: The Ruler vs. The Tape Measure

Standard Transformers are like a magical tape measure that can stretch infinitely but is made of a material that sometimes snaps unpredictably. You can't predict when it will break.
Fixed-Precision Transformers are like a rigid ruler. It has a limit to how long it can measure, but you know exactly where that limit is.

For these simpler models, the authors found the limit.

The Limit: To learn a rule, you need to see examples that are exponentially long.
What does "Exponential" mean? Imagine you need to see a sentence with 10 words to learn a rule. For the next level, you might need 100 words. Then 1,000. Then 1 million.
The Catch: While we can calculate this limit, the number gets so huge so fast that it's practically impossible to train the AI on data that long. It's like saying, "To learn this, you need to read every book in the library, plus every book that will ever be written."

Why Do Transformers Struggle with Long Texts?

The paper offers a fascinating explanation for why real-world AI models often fail at long tasks (like summarizing a 100-page book).

The "Needle in a Haystack" Problem
The authors suggest that the problem isn't that the AI is "dumb." It's that the AI needs to see a "needle" (a specific pattern) in a "haystack" (a long string of text) to learn the rule.

If the haystack is too big, the AI might never see the needle during training.
Because the "safe zone" for learning is so vast (exponentially large), the AI is essentially guessing in the dark when it encounters long inputs it hasn't seen before.

The "Goldilocks" Zone
This explains why AI sometimes works great on short texts, okay on medium texts, and fails on long ones. It's not a bug; it's a fundamental mathematical limitation. The AI hasn't seen enough "long" examples to be sure the rules still apply.

Summary: The Takeaway

No Magic Bullet: There is no simple formula to tell us how much data is enough to make an AI work on any length of text. For complex models, this limit is mathematically "uncomputable."
The Cost of Safety: For simpler, more predictable models, we can calculate the limit, but it requires training on data so massive (exponentially large) that it's practically impossible to achieve.
Real-World Impact: This explains why AI models are so sensitive to how they are trained. Small changes in settings (like learning speed or how they count words) can make the difference between an AI that understands a novel and one that gets lost after the first paragraph.

In a nutshell: We can't promise that an AI trained on short stories will automatically understand a novel. The math says the "safety net" is either non-existent or so huge it doesn't exist in practice. We have to be very careful when asking AI to handle long, complex tasks.

1. Problem Statement

The paper addresses the computability of length generalization bounds for Transformers.

Context: Length generalization is the ability of a model trained on finite data (strings of bounded length $N$ ) to correctly predict on inputs of arbitrary length ( $> N$ ).
The Gap: While empirical studies show Transformers often fail to generalize to longer sequences, and recent theoretical work (Chen et al., 2025) established that non-asymptotic length generalization is equivalent to the decidability of language equivalence, the specific computability of these bounds for the full class of Transformers remained an open problem.
Core Question: Is there an algorithm that can compute a finite bound $N$ such that training on strings up to length $N$ guarantees perfect learning of a language defined by a Transformer (or its logical equivalent, C-RASP)?

2. Methodology

The authors utilize Computational Learning Theory and Formal Language Theory to analyze the problem.

Theoretical Framework: They adopt the framework of Non-Asymptotic Length Generalization. A learner achieves this if, given an upper bound on the descriptional complexity of a target language, it can learn the language perfectly using only training data up to a computable length $N$ .
Key Equivalence: The paper relies on a crucial result from Chen et al. (2025): A computable length generalization bound exists if and only if the language equivalence problem (determining if two programs define the same language) is decidable for that class of languages.
Logical Formalism (C-RASP): The authors analyze C-RASP (Counting RASP), a programming language proven to be expressively equivalent to Transformers (with fixed precision outside attention).
- They define the syntax and semantics of C-RASP, which includes counting operators ( $\# \phi$ ), past operators ( $\ominus \phi$ ), and arithmetic comparisons.
- They distinguish between the general C-RASP and a restricted positive fragment C-RASP $^+$ (which omits subtraction/negative constants in specific ways).
Reduction Strategy: To prove undecidability, they reduce the problem to Hilbert's 10th Problem (the undecidability of Diophantine equations). To prove computable bounds for the restricted fragment, they reduce C-RASP $^+$ to Unary Temporal Logic (TL[ $\ominus$ ]).

3. Key Contributions and Results

A. Negative Result: Undecidability for General Transformers

The paper's primary contribution is proving that no computable length generalization bound exists for general Transformers (and C-RASP), even with as few as two layers.

Theorem 3.1 (Undecidability of Emptiness): It is undecidable whether a given C-RASP program defines the empty language.
- Proof Sketch: They construct a C-RASP program that encodes a system of Diophantine equations. The language defined by the program is non-empty if and only if the Diophantine system has a solution. Since solving Diophantine equations is undecidable (Matiyasevich, 1993), checking emptiness for C-RASP is undecidable.
Corollary 3.6 (Uncomputable Length Complexity): Since language equivalence is undecidable (reducible from emptiness), the length complexity (the maximum string length needed to distinguish hypotheses) is not computably bounded.
- Implication: The necessary training length to guarantee generalization grows faster than any computable function (e.g., the Ackermann function).
Theorem 5.1: Consequently, Transformers (even with depth 2) do not admit non-asymptotic length generalization. No learning algorithm can determine a finite training set size that guarantees perfect generalization for all Transformer-defined languages.

B. Positive Result: Exponential Bounds for Fixed-Precision Transformers

The authors identify a natural subset of C-RASP, called C-RASP $^+$ , which corresponds to Fixed-Precision Transformers (where attention weights are also rounded to finite precision).

Theorem 4.7 (Exponential Bound): For C-RASP $^+$ $^{+}$ , the length complexity is computable and exponential in the size of the program.
- Proof Sketch: They show C-RASP $^+$ can be translated into TL[ $\ominus$ ] (Unary Temporal Logic with only the strict past operator).
- The translation incurs an exponential blow-up in size.
- However, for any satisfiable TL[ $\ominus$ ] formula, there exists a witnessing string of length polynomial in the formula size.
- Combining these, the required training length is exponential in the original program size.
Theorem 5.2: Fixed-precision Transformers admit length generalization with exponential length complexity.
- Lower Bound: They prove this exponential bound is tight (optimal) by constructing a family of programs where the shortest distinguishing string has exponential length relative to the program size (due to binary encoding of constants).

4. Significance and Implications

Theoretical Limits: The paper establishes a fundamental theoretical barrier: General Transformers cannot be guaranteed to learn length generalization via finite data. This explains why empirical attempts to force generalization often fail or are highly sensitive to hyperparameters (initialization, learning rate).
Explanation of Empirical Failures: The authors suggest that the observed difficulty in length generalization is not merely a training dynamic issue but a fundamental complexity issue. To learn a perfectly generalizing Transformer, an algorithm might theoretically need to see strings of unfeasibly long (non-computable) lengths.
Fixed-Precision as a Solution: The results highlight that fixed-precision (rounding attention weights) is a crucial constraint that restores computability. While the bound is exponential (which is high, but finite), it provides a theoretical guarantee that fixed-precision models can generalize if trained on sufficiently long (but finite) sequences.
Distinction from Scaling Laws: The paper clarifies that length generalization is distinct from standard scaling laws (model size vs. loss). Increasing model size or data volume does not necessarily solve the length generalization problem if the underlying language class has uncomputable bounds.

Summary Table

Feature	General Transformers / C-RASP	Fixed-Precision Transformers / C-RASP $^+$
Expressivity	Full C-RASP (includes subtraction, complex counting)	Restricted C-RASP (positive fragment)
Emptiness Problem	Undecidable (reducible to Hilbert's 10th)	Decidable
Length Generalization Bound	Uncomputable (grows faster than any computable function)	Computable (Exponential in program size)
Training Requirement	No finite bound guarantees perfect learning	Must see strings of exponential length
Implication	Perfect generalization is theoretically impossible to guarantee	Generalization is possible but requires massive data

In conclusion, the paper resolves the open problem of length generalization bounds by proving that for general Transformers, such bounds are uncomputable, while for fixed-precision variants, they are computable but exponential. This provides a rigorous theoretical explanation for the empirical challenges in training Transformers to generalize to longer contexts.

Length Generalization Bounds for Transformers

The Big Question: Can AI "Learn to Swim" in Deep Water?

The Main Discovery: The "Uncomputable" Wall

The Silver Lining: The "Simple" Transformers

Why Do Transformers Struggle with Long Texts?

Summary: The Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions and Results

A. Negative Result: Undecidability for General Transformers

B. Positive Result: Exponential Bounds for Fixed-Precision Transformers

4. Significance and Implications

Summary Table

More like this

Robust Multi-agent Communication via Multi-view Message Certification

DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting

Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method

Forecasting Supply Chain Disruptions with Foresight Learning

UQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engression