Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

Imagine you are trying to restore an old, damaged photograph. You have a team of expert artists (the Transformer model) working together to fix the scratches, remove the rain, or make the blurry image sharp again.

For a long time, these artists have been using a specific rulebook called Layer Normalization (LN). The rulebook says: "Before you look at the whole picture, you must look at every single pixel individually, calculate its average brightness, and force it to fit into a tiny, standardized box."

The authors of this paper discovered a huge problem with this rulebook. It's like forcing every artist to work in a tiny, cramped closet. Because of this, the artists start panicking. They try to break out of the closet by shouting so loud that their voices (the feature magnitudes) become a million times louder than necessary. At the same time, they all start singing the exact same note, losing all their unique variety (this is the entropy collapse).

The result? The restoration gets messy, unstable, and the artists are working in a chaotic, screaming environment.

The Problem: The "Per-Token" Trap

The paper explains that the standard rulebook looks at each pixel (or "token") in isolation.

The Analogy: Imagine a choir where every singer is told to adjust their volume based only on their own voice, ignoring everyone else. If one singer gets a little louder, they might scream to compensate, and the whole choir becomes a cacophony.
The Consequence: In image restoration, this isolation breaks the natural relationship between neighboring pixels. It's like trying to understand a sentence by looking at one letter at a time without seeing the words. The "spatial correlation" (how pixels relate to their neighbors) gets destroyed.

The Solution: i-LN (The "Team Huddle" Approach)

The authors propose a new rulebook called i-LN (Image Restoration Tailored Layer Normalization). It fixes the problem with two simple, clever changes:

1. The "Group Hug" (Spatial Holisticness)

Instead of looking at one pixel at a time, i-LN looks at the entire image patch as a single group.

The Analogy: Instead of isolating each singer, the choir director gathers the whole group. They calculate the average volume for the entire choir and adjust everyone together.
The Benefit: This preserves the "shape" of the image. If one pixel is bright and its neighbor is dark, i-LN keeps that relationship intact. It stops the artists from screaming at each other and keeps the team working in harmony.

2. The "Dynamic Volume Knob" (Input-Adaptive Rescaling)

The old rulebook forced everyone into a rigid, pre-set box. The new rulebook says, "Hey, this specific photo is very dark, and that one is very bright. Let's adjust the volume knob for the whole team based on what we are looking at right now."

The Analogy: Imagine a sound engineer who doesn't just set a static volume limit. Instead, they listen to the specific song playing and dynamically adjust the master volume so the music sounds perfect for that specific track.
The Benefit: This allows the network to keep the unique "personality" of the image (its specific statistics) rather than flattening it into a boring, uniform shape.

Why Does This Matter?

The paper shows that with this new approach:

No More Screaming: The "shouting" (feature divergence) stops. The numbers stay calm and manageable, even when the model gets very large.
Better Restoration: The restored images are sharper, have fewer artifacts (weird glitches), and look more natural.
Works on Cheap Hardware: Because the numbers don't get crazy huge, the model works much better on low-precision devices (like mobile phones) without crashing or producing black screens.

The Bottom Line

The authors realized that the standard way of training these AI models was like trying to fix a delicate painting while wearing heavy, restrictive gloves. They simply took the gloves off and gave the artists a better set of tools (i-LN) that let them see the whole picture and adjust their work dynamically.

The result? A much more stable, efficient, and high-quality image restoration system that works better across the board, from removing rain to fixing old photos.

1. Problem Statement

The paper identifies a critical, previously overlooked issue in Image Restoration (IR) Transformers (e.g., HAT, SwinIR, DRCT) that utilize standard Layer Normalization (LN). While Transformers have become the standard for IR tasks, their internal training dynamics reveal two pathological phenomena when using conventional per-token LN:

Feature Magnitude Divergence: During training, the magnitude of intermediate features diverges dramatically, reaching scales as large as $10^6$ to $10^7$ .
Channel-Wise Entropy Collapse: Simultaneously, the entropy of features across channels drops sharply, indicating that feature activations become concentrated in a few specific channels (peaked distributions), while others become inactive.

Root Cause Analysis:
The authors hypothesize that these issues arise from a fundamental misalignment between the constraints of standard LN and the unique requirements of IR tasks:

Spatial Disruption: Standard LN operates per-token (independently on each patch/token). This disrupts the inter-pixel spatial correlations crucial for high-fidelity image restoration.
Input-Independent Scaling: Standard LN maps features into a unified normalized space, discarding input-specific statistics (global scale and mean) that are vital for preserving low-level details and handling varying degradation levels.
Network Adaptation: The network attempts to "bypass" these LN constraints by generating extreme feature values to force the normalization statistics into a degenerate state where structure is accidentally preserved, leading to instability.

2. Methodology: Image Restoration Transformer Tailored Layer Normalization (i-LN)

The authors propose i-LN, a simple "drop-in" replacement for standard LN that addresses the identified misalignments. It consists of two core components:

A. Spatial Holistic Normalization ( $LN^*$ )

Instead of calculating mean ( $\mu$ ) and variance ( $\sigma^2$ ) independently for each token (as in vanilla LN), i-LN computes these statistics holistically across the entire spatio-channel dimension of the feature map.

Mechanism: $\mu$ and $\sigma^2$ are calculated over all spatial locations ( $\ell$ ) and channels ( $c$ ) simultaneously.
Theoretical Benefit: This transforms the normalization operation into a homothety (uniform scaling and shifting). Theoretically, this preserves the inter-pixel structure (relative differences between tokens) up to a global scale, which vanilla LN fails to do.

B. Input-Adaptive Rescaling

To address the loss of input-dependent statistics caused by normalization (which restricts the range of internal representations), i-LN explicitly reintroduces the global scale information.

Mechanism: After the Attention or Feed-Forward Network (FFN) blocks, the output is rescaled by the standard deviation ( $\sigma$ ) computed during the preceding holistic normalization step.
Formula: $B(x; f, \text{i-LN}) = x + \sqrt{\sigma^2 + \epsilon} \cdot f(\text{LN}^*(x))$
Benefit: This allows the network to maintain range flexibility, ensuring that the magnitude of features remains bounded and consistent with the input statistics, preventing the divergence observed in standard LN.

3. Key Contributions

Discovery of Training Pathologies: The paper provides the first comprehensive analysis of feature magnitude divergence and entropy collapse in IR Transformers, linking them directly to the limitations of per-token LayerNorm.
Theoretical Insight: It formally proves that vanilla per-token LN is not a structure-preserving transformation (not a homothety), whereas the proposed holistic normalization ( $LN^*$ ) is, thereby preserving spatial correlations essential for IR.
i-LN Design: Introduces a simple, parameter-free modification (spatial holistic stats + input-adaptive rescaling) that effectively stabilizes training without increasing computational cost.
Comprehensive Validation: Demonstrates that i-LN improves stability, performance, and robustness across diverse IR tasks (Super-Resolution, Denoising, Deraining, JPEG Artifact Removal) and various Transformer backbones (HAT, SwinIR, DRCT).

4. Experimental Results

The authors conducted extensive experiments on standard benchmarks (Set5, Set14, BSD100, Urban100, Manga109, etc.) using models like HAT and SwinIR.

Performance Gains: i-LN consistently outperforms vanilla LN and other normalization schemes (RMSNorm, InstanceNorm, BatchNorm) across all tasks.
- Example (×4 SR on Urban100): HAT + i-LN achieves 27.17 dB PSNR vs. 26.55 dB for HAT + LN.
- Example (×4 SR on Manga109): HAT + i-LN achieves 31.82 dB PSNR vs. 31.01 dB for HAT + LN.
Training Stability:
- Feature Magnitudes: i-LN keeps feature magnitudes well-bounded (near $O(1)$ ), whereas vanilla LN leads to divergence ( $10^7$ ).
- Entropy: i-LN maintains high channel-wise entropy, preventing the collapse of feature distribution.
- Convergence: Unlike removing normalization entirely (which causes failure to converge), i-LN ensures stable convergence.
Robustness:
- Low-Precision Inference: i-LN shows superior robustness under FP16 and INT8/INT4 quantization. Vanilla LN often results in "black dot" artifacts (infinity values) due to unbounded feature magnitudes, while i-LN maintains fidelity.
- Real-World Scenarios: On Real-ESRGAN degradation pipelines, i-LN significantly outperforms baselines, reconstructing sharper edges and fine details.
Qualitative Improvements: Visual results show i-LN produces sharper edges, better texture recovery, and fewer artifacts compared to baselines. The learned Relative Position Embeddings (RPE) are more structured and less noisy, indicating better spatial correlation modeling.

5. Significance

This work fundamentally challenges the assumption that standard LayerNorm is universally suitable for Vision Transformers in low-level vision tasks.

Paradigm Shift: It highlights that spatial correlations and input-dependent statistics are non-negotiable for Image Restoration, necessitating normalization schemes that respect these properties.
Practical Impact: i-LN offers a computationally free (drop-in) solution that significantly boosts the performance and stability of existing IR models.
Deployment Readiness: The improved robustness to low-precision inference makes IR Transformers more viable for deployment on edge devices, a critical bottleneck for real-world applications.

In summary, the paper argues that the "divergence" in IR Transformers is not a bug but a symptom of a mismatch between normalization constraints and task requirements. By tailoring normalization to be spatially holistic and input-adaptive, i-LN resolves these issues, setting a new standard for training Image Restoration Transformers.

Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

The Problem: The "Per-Token" Trap

The Solution: i-LN (The "Team Huddle" Approach)

1. The "Group Hug" (Spatial Holisticness)

2. The "Dynamic Volume Knob" (Input-Adaptive Rescaling)

Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology: Image Restoration Transformer Tailored Layer Normalization (i-LN)

A. Spatial Holistic Normalization (LN∗LN^*LN∗)

B. Input-Adaptive Rescaling

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Evaluating Generalization and Robustness in Russian Anti-Spoofing: The RuASD Initiative

KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents

What Are Adversaries Doing? Automating Tactics, Techniques, and Procedures Extraction: A Systematic Review

Cardinality is Not Enough: Super Host Detection via Segmented Cardinality Estimation

A Dynamic Toolkit for Transmission Characteristics of Precision Reducers with Explicit Contact Geometry

A. Spatial Holistic Normalization ( $LN^*$ )