On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

This paper theoretically establishes that gradient descent achieves a linear convergence rate for both single-layer and multi-layer Transformers with residual connections, demonstrating that these connections mitigate the ill-conditioning caused by the softmax-induced low-rank structure to enhance optimization stability.

Original authors: Zhen Qin, Jinxin Zhou, Jiachen Jiang, Zhihui Zhu

Published 2026-04-14
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a very smart, but slightly clumsy, robot how to write a story. This robot is built using a special architecture called a Transformer.

In the world of AI, Transformers are like the ultimate Swiss Army knives. They power everything from chatbots (like the one you're talking to) to image generators. They are incredibly good at their job, but for a long time, scientists didn't fully understand how they learn so well. It was like watching a magician pull a rabbit out of a hat and saying, "It works, but we don't know the trick."

This paper is about uncovering that trick, specifically focusing on how the robot learns (its "training dynamics") and why a specific feature called Residual Connections is the secret sauce that keeps it from falling apart.

Here is the breakdown in simple terms:

1. The Robot's Brain: The Transformer

Think of the Transformer as a factory assembly line.

  • The Input: Raw data (like words in a sentence) comes in.
  • The Attention Mechanism: This is the "focus" module. It looks at all the words and decides which ones are important to each other. (e.g., in "The cat sat on the mat," it links "cat" to "sat").
  • The Feedforward Network: This is the "thinking" module. It processes the focused information to make sense of it.
  • The Problem: In a deep factory line, if you just pass the message from one station to the next, the message can get distorted, lost, or garbled by the time it reaches the end. This is called the "vanishing gradient" problem. The robot forgets what it was supposed to learn.

2. The Secret Weapon: Residual Connections

To fix the garbled message, engineers invented Residual Connections.

  • The Analogy: Imagine you are passing a note down a long line of people. Without a residual connection, you have to whisper the note from person to person. By the time it gets to the end, it's barely recognizable.
  • With a Residual Connection: It's like giving every person in the line a direct phone line back to the start. Even if the whisper gets messed up, they can also hear the original voice clearly. They can say, "Okay, I heard the whisper, but I'll just add the original message back in to make sure it's correct."

This paper proves mathematically that this "direct phone line" isn't just a nice-to-have; it's essential for the robot to learn quickly and stably.

3. The Main Discovery: Why It Converges (Succeeds)

The authors used advanced math to prove two main things:

A. The Robot Learns Fast (Linear Convergence)
They showed that if you set up the robot's brain correctly at the start (proper initialization), it will learn at a steady, predictable speed. It's not a slow, frustrating slog; it's a smooth slide down a hill toward the solution.

B. The "Rank Collapse" Problem
Here is the tricky part. The "Attention" part of the robot (the focus module) has a weird quirk. Because of how it calculates importance (using something called softmax), it sometimes squashes all the information into a single, flat line.

  • The Metaphor: Imagine a colorful, 3D sculpture. The Attention mechanism sometimes accidentally flattens it into a 2D drawing. Once it's flat, it's hard to see the details, and the robot gets confused. This is called Rank Collapse. When this happens, the robot's learning speed grinds to a halt because the math becomes "ill-conditioned" (basically, the numbers get messy and unstable).

C. How Residuals Save the Day
The paper proves that the Residual Connection acts like a stabilizer. Even if the Attention mechanism tries to flatten the sculpture, the Residual Connection adds the original 3D shape back in.

  • The Result: The math stays "well-conditioned." The numbers stay healthy. The robot doesn't get stuck. It keeps moving forward.

4. The Experiment

The researchers didn't just do math on paper; they tested it.

  • They built a simple version of the Transformer.
  • They trained one version with the "direct phone line" (Residuals) and one without.
  • The Result: The version with Residuals learned much faster and didn't get stuck. The version without them struggled, especially as the model got more complex.

The Big Takeaway

This paper is like a mechanic explaining why your car has a specific suspension system.

  • Before: We knew the car drove well, but we didn't know exactly how the suspension kept the wheels on the road during a storm.
  • Now: We know that the suspension (Residual Connections) prevents the car from losing traction (Rank Collapse) when the road gets bumpy (complex data). It ensures the car (the AI) reaches its destination (the solution) quickly and safely.

In short: Residual connections are the safety net that keeps Transformers from falling into a mathematical trap, ensuring they learn efficiently and reliably.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →