Why Are Linear RNNs More Parallelizable?

This paper establishes a theoretical foundation for the superior parallelizability of linear RNNs by demonstrating that they correspond to log-depth arithmetic circuits (NC1\mathsf{NC}^1-complete), whereas nonlinear RNNs are fundamentally limited by their ability to solve L\mathsf{L}- and P\mathsf{P}-complete problems, thereby explaining why linear variants can be efficiently parallelized like transformers while traditional nonlinear RNNs cannot.

William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal

Published 2026-03-06
📖 6 min read🧠 Deep dive

Imagine you are trying to build a super-smart robot that can read a book, understand the story, and answer questions about it. To do this, the robot needs a "brain" that processes information in two main ways:

  1. Sequential Thinking: Reading one word at a time, remembering what came before, and updating its understanding step-by-step. (This is like reading a book aloud to yourself).
  2. Parallel Thinking: Looking at the whole page at once, spotting patterns instantly, and processing everything simultaneously. (This is like scanning a page with your eyes to find a specific word immediately).

For a long time, the best robot brains (called Transformers) were great at parallel thinking but struggled with deep, sequential logic. The older robot brains (called Nonlinear RNNs) were great at sequential logic but terrible at parallel thinking—they were too slow because they had to do everything one step at a time.

Recently, a new type of brain called the Linear RNN (LRNN) has emerged. It promises the best of both worlds: it can think sequentially and run in parallel. But why? And are all Linear RNNs created equal?

This paper answers those questions by looking at the "mathematical DNA" of these brains. Here is the breakdown in simple terms.

1. The Core Problem: The Speed vs. Power Trade-off

Think of Nonlinear RNNs (the old-school brains) as a master chef cooking a complex stew.

  • How they work: They taste the soup, add a spice, taste it again, add another spice, and repeat. They can make incredibly complex flavors (they are very powerful).
  • The problem: They must do this one step at a time. You cannot have 100 chefs cooking the same stew simultaneously because the second chef needs to know what the first chef tasted. This makes them slow on modern computers that love to do many things at once.

Think of Transformers (the current standard) as a team of 100 chefs, each looking at a different part of the recipe at the same time.

  • How they work: They are incredibly fast.
  • The problem: They struggle with tasks that require strict, step-by-step logic (like counting how many times a specific letter appears in a very long sentence). They are "shallow" thinkers.

Linear RNNs are the new hybrid. They try to be the master chef who can somehow get 100 chefs to help without breaking the recipe.

2. The Big Discovery: Why Linear RNNs are Fast

The authors used a branch of math called Complexity Theory (which classifies how hard problems are to solve) to prove exactly why Linear RNNs are faster.

  • The Nonlinear RNN Barrier: The paper proves that nonlinear RNNs are so powerful they can solve problems that are mathematically "impossible" to parallelize efficiently.

    • Analogy: Imagine a puzzle where every piece depends on the piece before it in a way that creates a chain reaction. If you try to solve it with 1,000 people, they all have to wait for the person before them to finish. The paper shows that nonlinear RNNs are like this puzzle. To make them fast, you'd need to break the laws of math (specifically, a famous unsolved problem called P vs. NC).
  • The Linear RNN Advantage: Linear RNNs, however, are mathematically "shallow."

    • Analogy: Imagine a line of people passing a bucket of water. In a nonlinear RNN, the person at the end has to wait for the first person to pour, then the second, then the third... one by one. In a Linear RNN, the water flows through a system of pipes that allows the whole line to move almost instantly.
    • The Result: Linear RNNs can be simulated by a computer circuit that is only slightly deeper (slightly more "steps") than a Transformer. This means they can run on parallel hardware almost as fast as Transformers, while still keeping their sequential memory.

3. Not All Linear RNNs Are the Same

The paper also discovered that not all Linear RNNs are equally powerful. They split them into two camps:

  • The "Simple" Linear RNNs (PD LRNNs):

    • Analogy: These are like a train. It moves in a straight line, switching tracks occasionally. It's fast and parallelizable, but it can only solve problems that fit on a straight track.
    • Math: They are limited to a complexity class called NC1. They are great, but they hit a ceiling.
  • The "Advanced" Linear RNNs (DPLR LRNNs):

    • Analogy: These are like a high-speed maglev train with a switching network. They can still move fast in parallel, but they have a slightly more complex internal structure that allows them to handle much harder puzzles.
    • Math: They reach a higher complexity class called PNC1. They can solve problems that the "Simple" ones cannot, including complex math tasks like multiplying a long chain of matrices.

4. The Experiments: Theory Meets Reality

The authors didn't just do math; they built these robots and tested them on two specific puzzles:

  1. The Graph Puzzle: "If I start at point A, can I get to point Z?" (This is a classic "hard" sequential problem).
    • Result: The old Nonlinear RNNs solved it perfectly. The "Simple" Linear RNNs and Transformers failed. The "Advanced" Linear RNNs (like RWKV-7 and DeltaNet) solved it perfectly.
  2. The Matrix Puzzle: "Multiply a long list of matrices together."
    • Result: Again, the Advanced Linear RNNs and Nonlinear RNNs succeeded. The Transformers and "Simple" Linear RNNs failed.

The Takeaway

This paper gives us a roadmap for building the next generation of AI.

  • If you want speed: Use Linear RNNs. They are nearly as parallelizable as Transformers.
  • If you want smarts: You need the "Advanced" Linear RNNs (like DeltaNet or RWKV-7). They are powerful enough to solve hard logical puzzles that Transformers can't, without sacrificing the speed.
  • The Warning: If you try to make a Linear RNN too complex (add too many "nonlinear" twists), you break its ability to run in parallel, and it becomes slow again.

In short: Linear RNNs are the "Goldilocks" of AI architecture. They are just complex enough to be smart, but just simple enough to be fast. And within that family, the "Advanced" versions are the ones that can do the heavy lifting.