Towards High-resolution and Disentangled Reference-based Sketch Colorization

Imagine you are an artist who loves drawing black-and-white sketches of anime characters. Now, imagine you have a friend who is a master colorist. You want your friend to color your sketch based on a specific photo they have (maybe a photo of a sunset or a specific outfit).

The Problem: The "Confused" Colorist
In the past, computer programs trying to do this had a major brain glitch. They were trained by showing them a sketch and its perfect matching colored photo. But in the real world, you might give the computer a sketch of a cat and a photo of a car.

Because the computer was only trained on perfect matches, it got confused. It started thinking, "Oh, the photo tells me where to put the lines!" So, if you showed it a photo of a car, it might try to draw car wheels onto your cat sketch. It mixed up where things are (the sketch) with what things look like (the photo). We call this "Spatial Entanglement." It's like a chef who, when asked to cook a steak, accidentally starts building a house because the picture of the house was on the table.

The Solution: A "Dual-Brain" Training System
The researchers in this paper built a smarter system to fix this confusion. Here is how they did it, using some simple analogies:

1. The "Dual-Brain" Architecture (The Training Gym)

Instead of just one brain, they gave the AI two "brains" (or branches) that work together during training:

Brain A (The Idealist): This brain sees a sketch and its perfect matching photo. It learns what the final result should look like.
Brain B (The Realist): This brain sees a sketch and a random, mismatched photo (like the cat sketch and the car photo). It learns to handle the messiness of the real world.

By training both brains at the same time, the AI learns a crucial lesson: "The photo tells me the style and colors, but the sketch tells me the shape and location." It learns to ignore the photo when it comes to drawing lines.

2. The "Gram Regularization" (The Strict Coach)

How do you make sure the "Realist" brain doesn't get confused? The researchers added a special rule called Gram Regularization.

Think of this as a strict coach standing between the two brains. The coach looks at the "fingerprint" of what each brain is thinking. If the "Realist" brain starts copying the shapes from the random photo (like trying to draw car wheels on the cat), the coach slaps its hand and says, "No! Look at the sketch! Only the sketch decides the shape!"

This forces the AI to keep the "where" (sketch) and the "what" (color) completely separate, preventing the weird artifacts where objects bleed into each other.

3. The "Anime Dictionary" (The Tagger)

To make the colors look perfect, especially for anime, they replaced the AI's standard "language translator" with a specialized Anime Dictionary (WD-Tagger).

Old Way: The AI might just see "red" and "girl."
New Way: This special dictionary knows specific details like "school uniform," "long hair," "sunset background," or "blue eyes." It acts like a super-precise instruction manual, ensuring the AI picks the exact right shade of red for the specific type of outfit in the reference photo.

4. The "Texture Injector" (The Detail Brush)

Sometimes, the AI gets the colors right but the textures look blurry or flat, especially in the background. The researchers added a Plugin Module.
Think of this as a special brush that only paints the background and fine details. It takes the "vibe" and texture from the reference photo and gently glues it onto the background of the sketch, making the whole image look crisp and high-resolution (up to 1280px!), rather than blurry.

The Result

When you put all these tools together, the result is magic:

High Resolution: You can zoom in, and the lines and textures are sharp.
No Confusion: The AI never mixes up the sketch's shape with the photo's content.
Precise Control: If the reference photo has a specific hat color, the AI puts that exact hat color on your character, without messing up the rest of the drawing.

In a nutshell: This paper teaches an AI to be a professional colorist who knows exactly when to listen to the sketch (for shapes) and when to listen to the reference photo (for colors), without ever getting the two mixed up. The result is beautiful, high-quality, and perfectly controlled digital art.

1. Problem Statement

The paper addresses the critical challenge of Reference-based Sketch Colorization, where a model must colorize a line drawing (sketch) using a reference image for style and color guidance. The core difficulty identified is Spatial Entanglement caused by Distribution Shift:

Training vs. Inference Mismatch: During training, models use semantically aligned triplets (sketch, reference, and ground truth derived from the same image). However, during inference, the sketch and reference image are often arbitrary and semantically unrelated.
Spatial Entanglement: Due to this distribution shift, models erroneously learn that the reference image dictates the spatial structure of the output. This leads to severe artifacts such as:
- Structural contradictions (e.g., objects appearing where they shouldn't).
- Color bleeding across boundaries.
- Blurring or distortion of the original sketch geometry.
Limitations of Prior Work: Existing methods (GANs or Diffusion-based) primarily focus on mitigating visual artifacts post-hoc or using split attention mechanisms, but they fail to fundamentally resolve the distribution shift or achieve high-resolution (1024px+) synthesis with precise controllability.

2. Methodology

The authors propose a framework built on Stable Diffusion XL (SDXL) that explicitly models and corrects the distribution shift through three main components:

A. Dual-Branch Feature Alignment (DBFA)

To address the distribution shift, the authors introduce a dual-branch architecture that operates within a single training step:

Semantic-Aligned Branch: Takes a sketch and a reference derived from the same ground truth (simulating the training distribution).
Semantic-Misaligned Branch: Takes a sketch and a randomly sampled reference from the dataset (simulating the inference distribution).

Mechanism: Both branches share weights. The goal is to force the Misaligned Branch to produce spatial features identical to the Aligned Branch, despite the reference image being irrelevant to the sketch's structure.

B. Gram Regularization Loss

To enforce the disentanglement of spatial semantics (sketch) from style semantics (reference), a novel Gram Regularization Loss is applied:

Concept: The Gram matrix of feature maps captures spatial correlations between patches. The loss minimizes the difference between the Gram matrices of the two branches.
Effect: This forces the Misaligned Branch to ignore the spatial cues of the random reference image and rely exclusively on the sketch for structural segmentation.
Self-Anchoring: The Aligned Branch acts as a "fixed anchor" (gradients are stopped) to stabilize the optimization, preventing the network from collapsing.

C. Enhanced Control and Texture Transfer

WD-Tagger Network: Replaces the standard CLIP-L text encoder with a domain-specific WD-Tagger (trained on anime datasets). This provides fine-grained attribute control (e.g., specific hair colors, clothing types) and reduces stylistic bias, ensuring the model understands specific anime semantics better than generic CLIP embeddings.
Plugin Module: A dedicated module for the refinement stage that injects low-level visual features from the reference image. This improves texture synthesis, background consistency, and global style transfer, addressing the issue where Gram regularization might otherwise make backgrounds too random.

3. Key Contributions

Explicit Distribution Shift Modeling: The first framework to explicitly model the training/inference distribution gap using a dual-branch architecture to fundamentally solve spatial entanglement rather than just mitigating artifacts.
Gram Regularization Loss: A novel loss function that enforces cross-domain distribution coherence, ensuring spatial features remain invariant to the reference image.
High-Resolution & Controllability: Integration of SDXL, a specialized Tagger Network, and a Plugin module to achieve high-resolution (1024px–1280px) outputs with precise attribute control and vivid textures.
Comprehensive Evaluation: Extensive ablation studies, quantitative metrics, and user studies validating the superiority of the approach.

4. Experimental Results

The method was evaluated against State-of-the-Art (SOTA) baselines including Yan et al., ColorizeDiffusion, IP-Adapter, MagicColor, MangaNinja, and Cobra.

Quantitative Performance:
- Achieved the lowest FID (8.28), indicating superior perceptual quality and distribution alignment compared to the next best (12.09).
- Outperformed all baselines in MS-SSIM (0.70) and CLIP Score (0.912), demonstrating better structural similarity and semantic alignment.
- While MangaNinja had a slightly higher PSNR, the authors attribute this to its inability to generate complex backgrounds (leading to "average" results that score well on pixel-wise metrics but lack visual richness).
Qualitative Performance:
- Generated results at 1024px–1280px resolutions with no spatial artifacts.
- Successfully handled cross-content scenarios (e.g., applying a scenery reference to a character sketch) without structural collapse.
- Preserved fine details (textures, eye colors) and maintained geometric consistency with the input sketch.
User Study:
- In a study with 30 participants, the proposed method was preferred in 68.8% to 79.6% of comparisons against all other baselines, with statistical significance ( $p < 0.01$ ).

5. Significance

This work represents a significant leap in image-guided generation for creative industries (animation and digital illustration).

Fundamental Breakthrough: It moves beyond "patching" artifacts to solving the root cause of distribution shift in reference-based generation.
Industrial Applicability: By enabling high-resolution, artifact-free colorization with precise control over specific attributes, the method is directly applicable to streamlining animation production workflows, reducing manual labor, and enhancing digital illustration tools.
Open Source: The code is released, fostering further research in disentangled diffusion models.

Code Availability: https://github.com/tellurion-kanata/ColorizeDiffusionXL