Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

This paper utilizes a Lie-algebraic control perspective to demonstrate that increasing the depth of parallelizable sequence models exponentially reduces approximation error by expanding their expressivity through a tower of Lie algebra extensions, a finding validated by experiments on symbolic and continuous state-tracking tasks.

Gyuryang Heo, Timothy Ngotiaoco, Kazuki Irie, Samuel J. Gershman, Bernardo Sabatini

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View" using simple language and creative analogies.

The Big Picture: The "Order" Problem

Imagine you are giving a robot a set of instructions to build a Lego castle.

  • The Robot's Superpower: This robot is incredibly fast because it can look at all the Lego bricks at once and decide what to do with them simultaneously (this is called parallelism).
  • The Robot's Weakness: The robot has a strange rule: it doesn't care about the order in which you hand it the bricks. If you say "Put the red brick on top, then the blue one," it treats it the same as "Put the blue one on top, then the red one."

In the real world, order matters. In math, language, and physics, doing Action A then Action B is often very different from doing B then A.

  • Example: Putting on your socks then your shoes works. Putting on your shoes then your socks is a disaster.

The paper asks: If a robot ignores order, how bad does it get at tasks where order is crucial? And can we fix it by making the robot "deeper" (giving it more layers of thinking)?


The Core Idea: The "Lie Algebra" Compass

The authors use a branch of math called Lie Algebra to measure exactly how much order matters.

The Analogy: The Hiking Trip
Imagine you are hiking in a forest.

  1. Action A: Walk 1 mile North.
  2. Action B: Walk 1 mile East.
  • If the world is flat and simple (Abelian/Commutative): It doesn't matter if you go North then East, or East then North. You end up at the exact same spot.
  • If the world is curved or complex (Non-Abelian): If you go North then East, you end up at Spot X. If you go East then North, you end up at Spot Y. The difference between Spot X and Spot Y is the "Order Error."

The paper uses Lie Algebra to measure this "Spot X vs. Spot Y" gap. It turns out that for complex tasks (like solving a Rubik's cube or understanding a sentence), this gap is huge.


The Solution: Depth is the "Undo Button"

The paper's main discovery is that Depth (adding more layers to the neural network) acts like a way to fix this order error.

The Analogy: The Tower of Translators
Imagine you need to translate a complex sentence from a language where word order changes the meaning (like Japanese) into English.

  • Shallow Model (1 Layer): It tries to translate the whole sentence at once, ignoring order. It gets the meaning wrong.
  • Deep Model (Many Layers): Instead of doing it all at once, it breaks the sentence down.
    • Layer 1 handles simple, order-independent chunks.
    • Layer 2 looks at how Layer 1's chunks interact.
    • Layer 3 fixes the mistakes Layer 2 made regarding order.

The authors prove mathematically that every time you add a layer, you reduce the "Order Error" exponentially.

  • 1 layer = Big error.
  • 2 layers = Tiny error.
  • 4 layers = Almost perfect.

It's like climbing a tower. The higher you go (the deeper the model), the better you can see the "shape" of the problem and correct the mistakes caused by ignoring the order of events.


The Experiments: Testing the Theory

The researchers tested this on two types of problems:

  1. Symbolic Puzzles (Word Problems):

    • They gave models puzzles involving groups of symbols (like a digital Rubik's cube).
    • Result: Shallow models failed miserably on complex puzzles. But as they added layers, the models suddenly got much better, exactly as the math predicted.
  2. Physical Rotation (3D Space):

    • They asked models to predict how a 3D object rotates when you spin it in different orders.
    • Result: Again, shallow models were confused. Deep models learned to track the rotation perfectly, proving that "depth" allows the model to understand the physics of order.

The Catch: "Learnability" vs. "Expressivity"

There is a small twist. The math says deep models can solve these problems perfectly. But in practice, training these deep models is hard.

  • The Analogy: It's like having a brilliant student (the deep model) who knows the answer, but is so nervous during the exam (training) that they freeze up.
  • The paper found that while deep models have the potential (expressivity) to solve these hard tasks, they sometimes struggle to learn how to do it without getting stuck.

Summary in One Sentence

Parallelizable models (like Transformers) are great at speed but bad at understanding order; however, by stacking them into deep towers, we can mathematically prove that they can "learn" to ignore the order-bias and solve complex, order-sensitive problems with near-perfect accuracy.

Why This Matters

This explains why modern AI (like LLMs) works so well despite having structural limitations. It tells us that if we want to solve harder problems, we shouldn't just make the model wider (more neurons); we should make it deeper (more layers), because depth is the key to unlocking the ability to understand the sequence of events.