Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

This paper identifies that conventional Layer Normalization causes feature divergence and entropy collapse in Image Restoration Transformers due to misaligned spatial and statistical constraints, and proposes a tailored i-LN module that holistically normalizes and adaptively rescales features to resolve these issues and improve both training dynamics and performance.

MinKyu Lee, Sangeek Hyun, Woojin Jun, Hyunjun Kim, Jiwoo Chung, Jae-Pil Heo

Published 2026-02-23
📖 4 min read☕ Coffee break read

Imagine you are trying to restore an old, damaged photograph. You have a team of expert artists (the Transformer model) working together to fix the scratches, remove the rain, or make the blurry image sharp again.

For a long time, these artists have been using a specific rulebook called Layer Normalization (LN). The rulebook says: "Before you look at the whole picture, you must look at every single pixel individually, calculate its average brightness, and force it to fit into a tiny, standardized box."

The authors of this paper discovered a huge problem with this rulebook. It's like forcing every artist to work in a tiny, cramped closet. Because of this, the artists start panicking. They try to break out of the closet by shouting so loud that their voices (the feature magnitudes) become a million times louder than necessary. At the same time, they all start singing the exact same note, losing all their unique variety (this is the entropy collapse).

The result? The restoration gets messy, unstable, and the artists are working in a chaotic, screaming environment.

The Problem: The "Per-Token" Trap

The paper explains that the standard rulebook looks at each pixel (or "token") in isolation.

  • The Analogy: Imagine a choir where every singer is told to adjust their volume based only on their own voice, ignoring everyone else. If one singer gets a little louder, they might scream to compensate, and the whole choir becomes a cacophony.
  • The Consequence: In image restoration, this isolation breaks the natural relationship between neighboring pixels. It's like trying to understand a sentence by looking at one letter at a time without seeing the words. The "spatial correlation" (how pixels relate to their neighbors) gets destroyed.

The Solution: i-LN (The "Team Huddle" Approach)

The authors propose a new rulebook called i-LN (Image Restoration Tailored Layer Normalization). It fixes the problem with two simple, clever changes:

1. The "Group Hug" (Spatial Holisticness)

Instead of looking at one pixel at a time, i-LN looks at the entire image patch as a single group.

  • The Analogy: Instead of isolating each singer, the choir director gathers the whole group. They calculate the average volume for the entire choir and adjust everyone together.
  • The Benefit: This preserves the "shape" of the image. If one pixel is bright and its neighbor is dark, i-LN keeps that relationship intact. It stops the artists from screaming at each other and keeps the team working in harmony.

2. The "Dynamic Volume Knob" (Input-Adaptive Rescaling)

The old rulebook forced everyone into a rigid, pre-set box. The new rulebook says, "Hey, this specific photo is very dark, and that one is very bright. Let's adjust the volume knob for the whole team based on what we are looking at right now."

  • The Analogy: Imagine a sound engineer who doesn't just set a static volume limit. Instead, they listen to the specific song playing and dynamically adjust the master volume so the music sounds perfect for that specific track.
  • The Benefit: This allows the network to keep the unique "personality" of the image (its specific statistics) rather than flattening it into a boring, uniform shape.

Why Does This Matter?

The paper shows that with this new approach:

  1. No More Screaming: The "shouting" (feature divergence) stops. The numbers stay calm and manageable, even when the model gets very large.
  2. Better Restoration: The restored images are sharper, have fewer artifacts (weird glitches), and look more natural.
  3. Works on Cheap Hardware: Because the numbers don't get crazy huge, the model works much better on low-precision devices (like mobile phones) without crashing or producing black screens.

The Bottom Line

The authors realized that the standard way of training these AI models was like trying to fix a delicate painting while wearing heavy, restrictive gloves. They simply took the gloves off and gave the artists a better set of tools (i-LN) that let them see the whole picture and adjust their work dynamically.

The result? A much more stable, efficient, and high-quality image restoration system that works better across the board, from removing rain to fixing old photos.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →