Algorithmic Capture, Computational Complexity, and Inductive Bias of Infinite Transformers

This paper formally defines algorithmic capture as the ability of infinite-width transformers to generalize to arbitrary problem sizes and demonstrates that, despite their universal expressivity, their inductive bias restricts them to learning only low-complexity algorithms within the Efficient Polynomial Time Heuristic Scheme (EPTHS) class.

Orit Davidovich, Zohar Ringel

Published Fri, 13 Ma
📖 5 min read🧠 Deep dive

Imagine you have a super-smart student named Transformer. This student is famous for reading millions of books and memorizing patterns. If you ask them to finish a sentence, they are usually spot on. But here's the big question: Does Transformer actually understand how to solve a problem, or are they just really good at guessing the next word based on what they've seen before?

This paper tries to answer that by testing if Transformer can truly "learn" an algorithm (like a recipe for sorting numbers or finding the shortest path) and apply it to problems of any size, even ones it has never seen before.

Here is the breakdown of their findings using some everyday analogies:

1. The "Grokking" Test: Memorization vs. Understanding

The authors define a strict test called "Algorithmic Capture."

  • The Scenario: Imagine you teach Transformer to sort a list of 10 numbers.
  • The Real Test: Can it then sort a list of 1,000 numbers, or even 1,000,000 numbers, with almost no extra practice?
  • The Result:
    • True Understanding: If it learns the logic of sorting, it can handle the huge list easily. It's like learning the rules of chess; you can play a game with 100 pieces just as well as with 32.
    • Statistical Guessing: If it just memorized patterns from the small list, it will get confused and fail when the list gets huge. It's like a student who memorized the answers to a 10-question quiz but fails the 1,000-question exam because they didn't learn the math.

2. The "Brain Size" Experiment

To be fair, the researchers didn't just use a normal computer. They imagined a Transformer with infinite brain power (infinite width).

  • The Analogy: Think of a normal Transformer as a regular human brain. An "infinite" Transformer is like a supercomputer that can hold every possible thought in the universe at once.
  • The Surprise: Even with this god-like brain power, the Transformer still couldn't learn everything.

3. The "Speed Limit" on Thinking

The paper discovered that Transformers have a hidden speed limit on how complex a problem they can solve in real-time.

  • The Metaphor: Imagine a librarian (the Transformer) trying to find a book in a library.
    • Easy Tasks (Induction & Sorting): If the librarian needs to find a book based on a simple pattern (like "find the book that comes after 'A'"), they can do it quickly, even if the library grows to the size of a city. This is like O(T²) complexity (quadratic). It gets harder as the library grows, but it's manageable.
    • Hard Tasks (Shortest Path & Max Flow): If the librarian needs to find the absolute shortest path through a maze that changes every time, or calculate the maximum water flow through a complex pipe network, the time it takes explodes.
  • The Finding: The Transformer's "thinking speed" is capped. It can handle tasks that scale like a square (T2T^2) or a cube (T3T^3), but it hits a wall with anything more complex. It's like trying to run a marathon; you can run fast for a while, but eventually, your legs give out. No matter how much you "understand" the concept of running, you physically cannot run faster than your body allows.

4. What Can It Actually Do?

The researchers tested Transformer on specific puzzles:

  • ✅ It Succeeded at:
    • Induction Heads: Finding a pattern like "If I see 'A', the next letter is 'B'." (Like a detective spotting a clue).
    • Sorting: Arranging numbers from smallest to largest.
    • Why? These tasks are like simple recipes. The Transformer can follow the steps efficiently.
  • ❌ It Failed at:
    • Shortest Path: Finding the quickest route between two points in a massive, random map.
    • Min-Cut/Max-Flow: Figuring out how to split a network to stop the most traffic.
    • Why? These tasks require a level of "global planning" that is too computationally heavy for the Transformer's architecture. It's like asking a person to solve a Rubik's cube while blindfolded; they might get lucky on a small cube, but a giant one is impossible for their specific method.

5. The "Lazy" vs. "Rich" Learning

The paper also looked at how the Transformer learns:

  • Lazy Learning: The Transformer just tweaks its existing knowledge slightly (like adjusting the volume on a radio). It's fast but limited.
  • Rich Learning: The Transformer rewires its brain to learn new features (like learning to play a new instrument).
  • The Conclusion: Even when the Transformer "rewires" its brain to be smarter, it still hits the same speed limit. It can't magically become a super-computer that solves impossible math problems instantly.

The Big Takeaway

This paper tells us that Large Language Models (LLMs) are not magic.

They are incredibly good at pattern matching and can learn simple algorithms (like sorting or copying). However, they have a fundamental inductive bias (a built-in preference) for simple, low-complexity solutions. They are not designed to be universal problem solvers for every type of math or logic puzzle.

In short: Transformer is a brilliant librarian who can organize books and find simple patterns, but if you ask it to solve a complex, multi-layered maze in real-time, it will hit a wall. It's not that it doesn't "try"; it's that its brain architecture has a built-in speed limit.