ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

Imagine you have a blurry, scratched, or pixelated photo that you desperately want to fix. Maybe it's an old family picture, a low-resolution screenshot, or a noisy selfie. This is the world of Image Restoration.

For a long time, computers tried to fix these photos using "local" thinking. They looked at a tiny neighborhood of pixels and asked, "What does the pixel next to me look like?" This is like trying to solve a jigsaw puzzle by only looking at the pieces immediately touching the one you're holding. It works okay for smooth areas, but if you need to fix a complex pattern (like a brick wall or a tree branch) that appears in different parts of the image, this local approach fails.

Enter Transformers. These are powerful AI models that can look at the entire image at once to find patterns. However, looking at every single pixel against every other pixel is like trying to introduce every person in a stadium to every other person. It's incredibly slow and computationally expensive (it takes too much energy and time). To speed things up, most current models only look at small "windows" or neighborhoods again, losing the big picture.

ATD (Adaptive Token Dictionary) is the new hero of this paper. It solves the problem of "How do we see the whole picture without getting tired?" by using a clever mix of a Dictionary, a Categorization System, and a Smart Assistant.

Here is how it works, using simple analogies:

1. The "Master Dictionary" (The External Brain)

Imagine you are trying to fix a broken vase. Instead of just guessing what the missing piece looks like based on the shards next to it, you have a Master Dictionary of every possible vase pattern in the world.

How ATD does it: The AI learns a "Token Dictionary" during training. This is a library of "typical image structures" (like edges, textures, or repeating patterns) it has seen in thousands of training photos.
The Magic: When the AI sees a blurry patch, it doesn't just guess. It asks its Dictionary: "Hey, does this blurry patch look like the 'brick wall' pattern in entry #42, or the 'leaf' pattern in entry #89?" It pulls in this external knowledge to help fill in the gaps.

2. The "Smart Sorter" (Adaptive Categorization)

Usually, AI models chop an image into a grid (like a spreadsheet) and only talk to neighbors in the same square. This is rigid.

The ATD Innovation: Instead of sorting pixels by where they are (top-left, bottom-right), ATD sorts them by what they are.
The Analogy: Imagine a massive library. Instead of organizing books by their shelf location, you organize them by "Genre." All the "Sci-Fi" books are grouped together, even if they are on different floors.
Why it helps: If the AI is trying to fix a window in a building, it groups that window with all other windows in the image, even if they are on opposite sides of the photo. This allows the AI to say, "I know what a window looks like because I just looked at the window on the other side of the room." This is Global Self-Attention without the heavy cost.

3. The "Specialized Assistant" (Category-Aware FFN)

Once the AI has grouped similar things together, it needs to process them.

The Innovation: The paper introduces a "Category-Aware Feed-Forward Network." Think of this as a specialized assistant who knows exactly which "Genre" of book you are reading.
How it works: If the AI is processing a group of "sky" pixels, this assistant knows to apply "sky-like" rules (smooth gradients, blue tones). If it's processing "fur," it applies "fur-like" rules (texture, noise). It adapts its processing based on the category it just sorted the pixels into.

The Result: Why is this better?

Speed vs. Quality: Old methods were either fast but blurry (local windows) or sharp but slow (global attention). ATD is like a high-speed train that stops at every station. It gets the global view (seeing the whole city) but moves efficiently (linear complexity) by only talking to relevant groups.
Real-World Impact: The authors tested this on:
- Super-Resolution: Making small, blurry images huge and sharp.
- Denoising: Removing grainy static from photos.
- JPEG Removal: Fixing the blocky artifacts from compressed images.

In a nutshell:
ATD is like a master restorer who carries a library of perfect patterns (the Dictionary), groups similar items together regardless of where they are in the room (Adaptive Categorization), and uses a specialized tool for each group (Category-Aware Assistant). This allows it to fix damaged photos faster and better than any previous method, seeing the "big picture" without getting overwhelmed.

1. Problem Statement

Image restoration (IR) tasks, such as super-resolution (SR), denoising, and JPEG artifact removal, aim to reconstruct high-quality images from degraded inputs. While Transformers have surpassed Convolutional Neural Networks (CNNs) in IR due to their ability to model long-range dependencies via self-attention, they face a critical bottleneck: computational complexity.

Quadratic Complexity: Standard self-attention mechanisms scale quadratically with the number of tokens ( $O(N^2)$ ), making them computationally prohibitive for high-resolution images.
Local Window Limitations: To mitigate this, existing methods (e.g., SwinIR, HAT) restrict attention to local windows. However, this limits the receptive field, preventing the model from capturing global self-similarities and long-range dependencies essential for complex restoration tasks.
Sparse Attention Trade-offs: Alternative sparse attention methods often suffer from poor relevance preservation or fail to balance efficiency with performance effectively.

The core challenge is to achieve global dependency modeling with linear computational complexity relative to the image size.

2. Methodology

The authors propose Adaptive Token Dictionary (ATD), a novel Transformer architecture that integrates external priors (learned from data) and adaptive grouping strategies to overcome the limitations of window-based attention. The framework consists of three core components:

A. Token Dictionary Cross-Attention (TDCA)

Inspired by traditional dictionary learning, ATD introduces a learnable token dictionary ( $D$ ) that summarizes typical image structures (external priors) during training.

Mechanism: Instead of computing attention only between input tokens, TDCA computes cross-attention between input tokens (Query) and the learned dictionary (Key/Value).
Logarithmic Scaling: To address the issue of attention weight dilution as the dictionary size increases, the authors propose a reparameterized scaling factor: $\tau' = 1 + \tau \ln(M)$ , where $M$ is the dictionary size. This logarithmic scaling enhances the contrast between relevant and irrelevant dictionary entries, enforcing sparsity and ensuring the model focuses on the most informative structural patterns.

B. Adaptive Category-Based Self-Attention (AC-MSA)

Rather than partitioning images based on fixed spatial coordinates (windows), ATD partitions tokens based on semantic similarity derived from the TDCA attention maps.

Categorization: Each input token is assigned to a category based on the dictionary atom it most closely resembles (i.e., the index of the maximum attention value in the TDCA map).
Global Grouping: Tokens with similar structural features are grouped together, regardless of their spatial distance. This allows the model to connect distant but similar regions (e.g., repeating textures in different parts of an image).
Efficiency: To maintain linear complexity and ensure parallelism, these large categories are further divided into fixed-size sub-categories. Self-attention is then computed within these sub-categories. This approach enables global self-similarity mining without the quadratic cost of full self-attention.

C. Category-Aware Feed-Forward Network (CFFN)

The authors enhance the standard Feed-Forward Network (FFN) by injecting category information.

Mechanism: The most relevant dictionary token (representing the token's category) is concatenated with the intermediate features before the depth-wise convolution.
Benefit: This allows the FFN to adaptively fuse local features with global structural priors, improving feature representation and restoration quality.

Architecture Variants:

ATD: Designed for Image Super-Resolution (SR), utilizing a residual-in-residual architecture.
ATD-U: A U-Net based multi-scale variant designed for Image Denoising and JPEG Compression Artifact Removal (CAR).

3. Key Contributions

Novel Attention Mechanism: Proposes a Transformer framework that leverages a learnable token dictionary to incorporate external image priors, enabling global dependency modeling with linear complexity.
Adaptive Partitioning: Introduces a content-aware, category-based partitioning strategy that groups similar features across the entire image, overcoming the limited receptive field of window-based methods.
Architectural Improvements:
- A reparameterized scaling factor for TDCA to prevent attention dilution in large dictionaries.
- A Category-aware FFN (CFFN) that adaptively fuses structural priors into feature transformation.
State-of-the-Art Performance: The proposed models (ATD and ATD-U) achieve superior results across multiple benchmarks compared to existing CNN and Transformer-based methods.

4. Experimental Results

The authors evaluated ATD on standard benchmarks for Super-Resolution, Denoising, and JPEG CAR.

Image Super-Resolution (SR):
- ATD outperforms SOTA methods (including HAT, SwinIR, and MambaIRv2) on Set5, Set14, BSD100, Urban100, and Manga109.
- On Urban100 (known for complex repetitive structures), ATD improves PSNR by 0.29–0.40 dB over HAT and 0.27–0.35 dB over MambaIRv2.
- ATD-light (lightweight version) achieves the best performance among lightweight models, surpassing MambaIRv2-light by up to 0.28 dB on ×4 SR.
- Efficiency: ATD achieves faster inference speeds (25–50% faster than MambaIRv2) and lower GPU memory usage (~30% less than HAT) while delivering higher accuracy.
Image Denoising & JPEG CAR:
- ATD-U achieves state-of-the-art results on color and grayscale denoising benchmarks (CBSD68, Kodak24, Set12, BSD68) and JPEG artifact removal (Classic5, LIVE1, Urban100).
- Qualitative results show superior recovery of fine textures and sharp edges in high-noise and high-compression scenarios compared to DRUNet, Restormer, and SwinIR.

5. Significance

This work addresses the fundamental trade-off between receptive field size and computational efficiency in Transformer-based image restoration. By shifting from spatial partitioning to semantic/category-based partitioning guided by a learned dictionary, ATD successfully models global self-similarities without incurring quadratic complexity.

The significance lies in:

Bridging Traditional and Deep Learning: It effectively bridges the gap between traditional dictionary learning (external priors) and modern Transformer architectures.
Scalability: It provides a scalable solution for high-resolution image restoration, making global attention feasible for practical applications.
Versatility: The framework is adaptable to various low-level vision tasks (SR, denoising, CAR) through its multi-scale U-Net variant (ATD-U).

In conclusion, ATD represents a significant step forward in efficient global dependency modeling, setting a new benchmark for performance and efficiency in image restoration tasks.

ATD: Improved Transformer with Adaptive Token Dictionary for Image Restoration

1. The "Master Dictionary" (The External Brain)

2. The "Smart Sorter" (Adaptive Categorization)

3. The "Specialized Assistant" (Category-Aware FFN)

The Result: Why is this better?

1. Problem Statement

2. Methodology

A. Token Dictionary Cross-Attention (TDCA)

B. Adaptive Category-Based Self-Attention (AC-MSA)

C. Category-Aware Feed-Forward Network (CFFN)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization