Towards High-resolution and Disentangled Reference-based Sketch Colorization

This paper presents a dual-branch framework with Gram Regularization Loss and an anime-specific Tagger Network to directly minimize the distribution shift between training and inference data, achieving state-of-the-art high-resolution, disentangled, and controllable reference-based sketch colorization.

Dingkun Yan, Xinrui Wang, Ru Wang, Zhuoru Li, Jinze Yu, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are an artist who loves drawing black-and-white sketches of anime characters. Now, imagine you have a friend who is a master colorist. You want your friend to color your sketch based on a specific photo they have (maybe a photo of a sunset or a specific outfit).

The Problem: The "Confused" Colorist
In the past, computer programs trying to do this had a major brain glitch. They were trained by showing them a sketch and its perfect matching colored photo. But in the real world, you might give the computer a sketch of a cat and a photo of a car.

Because the computer was only trained on perfect matches, it got confused. It started thinking, "Oh, the photo tells me where to put the lines!" So, if you showed it a photo of a car, it might try to draw car wheels onto your cat sketch. It mixed up where things are (the sketch) with what things look like (the photo). We call this "Spatial Entanglement." It's like a chef who, when asked to cook a steak, accidentally starts building a house because the picture of the house was on the table.

The Solution: A "Dual-Brain" Training System
The researchers in this paper built a smarter system to fix this confusion. Here is how they did it, using some simple analogies:

1. The "Dual-Brain" Architecture (The Training Gym)

Instead of just one brain, they gave the AI two "brains" (or branches) that work together during training:

  • Brain A (The Idealist): This brain sees a sketch and its perfect matching photo. It learns what the final result should look like.
  • Brain B (The Realist): This brain sees a sketch and a random, mismatched photo (like the cat sketch and the car photo). It learns to handle the messiness of the real world.

By training both brains at the same time, the AI learns a crucial lesson: "The photo tells me the style and colors, but the sketch tells me the shape and location." It learns to ignore the photo when it comes to drawing lines.

2. The "Gram Regularization" (The Strict Coach)

How do you make sure the "Realist" brain doesn't get confused? The researchers added a special rule called Gram Regularization.

Think of this as a strict coach standing between the two brains. The coach looks at the "fingerprint" of what each brain is thinking. If the "Realist" brain starts copying the shapes from the random photo (like trying to draw car wheels on the cat), the coach slaps its hand and says, "No! Look at the sketch! Only the sketch decides the shape!"

This forces the AI to keep the "where" (sketch) and the "what" (color) completely separate, preventing the weird artifacts where objects bleed into each other.

3. The "Anime Dictionary" (The Tagger)

To make the colors look perfect, especially for anime, they replaced the AI's standard "language translator" with a specialized Anime Dictionary (WD-Tagger).

  • Old Way: The AI might just see "red" and "girl."
  • New Way: This special dictionary knows specific details like "school uniform," "long hair," "sunset background," or "blue eyes." It acts like a super-precise instruction manual, ensuring the AI picks the exact right shade of red for the specific type of outfit in the reference photo.

4. The "Texture Injector" (The Detail Brush)

Sometimes, the AI gets the colors right but the textures look blurry or flat, especially in the background. The researchers added a Plugin Module.
Think of this as a special brush that only paints the background and fine details. It takes the "vibe" and texture from the reference photo and gently glues it onto the background of the sketch, making the whole image look crisp and high-resolution (up to 1280px!), rather than blurry.

The Result

When you put all these tools together, the result is magic:

  • High Resolution: You can zoom in, and the lines and textures are sharp.
  • No Confusion: The AI never mixes up the sketch's shape with the photo's content.
  • Precise Control: If the reference photo has a specific hat color, the AI puts that exact hat color on your character, without messing up the rest of the drawing.

In a nutshell: This paper teaches an AI to be a professional colorist who knows exactly when to listen to the sketch (for shapes) and when to listen to the reference photo (for colors), without ever getting the two mixed up. The result is beautiful, high-quality, and perfectly controlled digital art.