Style-Aware Gloss Control for Generative Non-Photorealistic Rendering

Imagine you are looking at a painting of a shiny red apple. Even though it's just paint on canvas, your brain instantly knows: "That apple is glossy." It knows this even if the artist used thick, chunky brushstrokes (like Van Gogh) or thin, precise ink lines.

This paper is about teaching computers to understand that same magic trick: how to separate the "shininess" of an object from the "artistic style" used to draw it.

Here is the story of how they did it, broken down into simple concepts and analogies.

1. The Problem: The "Style vs. Shine" Mix-Up

Imagine you have a robot artist. You tell it, "Draw a shiny apple in a charcoal style."

Old robots often get confused. If you ask for "shiny," they might just draw a different type of charcoal stroke, thinking the stroke pattern is the shininess.
The Goal: The researchers wanted a robot that understands that "shininess" (gloss) and "charcoal style" are two different ingredients. They wanted the robot to be able to change the apple from matte to super shiny without changing the fact that it looks like a charcoal drawing.

2. The Solution: A "Layered Cake" of Understanding

To teach the robot, the researchers didn't just show it random pictures. They baked a very specific, controlled "cake" (a dataset) for the robot to study.

The Ingredients: They took 3D models of objects (like spheres and bats) and painted them in three different styles: Charcoal, Ink Pen, and Oil Painting.
The Variable: For every single object, they painted it at seven different levels of shininess, from dull and dusty to mirror-bright.
The Trick: They made sure the "brushstrokes" (the texture of the paint) stayed exactly the same for every level of shininess. This forced the robot to realize: "Wait, the brushstrokes didn't change, but the shine did. Therefore, shine must be a separate thing!"

3. The Discovery: The "Magic Control Panel"

They trained a special AI (a Generative Adversarial Network, or GAN) on this dataset. When the AI finished learning, they peeked inside its "brain" (its internal data layers).

They found something amazing: The AI had organized its knowledge like a multi-layered control panel.

Layers 1–5 (The Foundation): These layers decided the shape of the object and where the light was coming from.
Layer 6 (The Shine Knob): This specific layer was dedicated entirely to gloss. If you tweaked this layer, the object got shinier or duller, but the style stayed the same.
Layer 8 (The Style Knob): This layer decided if the object looked like charcoal, ink, or oil paint.
Layers 9–15 (The Color Paint): These layers handled the colors.

The Analogy: Think of the AI's brain like a mixing board at a recording studio. Before this paper, the "Volume" and "Bass" knobs were stuck together. If you turned up the volume, the bass changed too. This paper found a way to separate them, so you can turn up the "Shine" knob without accidentally changing the "Style" knob.

4. The Application: The "Smart Adapter"

Knowing where the "Shine" and "Style" knobs were located in the AI's brain was great, but they wanted to use it for something practical.

They built a lightweight adapter (a small, smart bridge) that connects this specialized "Shine/Style" brain to a modern, powerful image generator (called a Diffusion Model, similar to DALL-E or Midjourney).

How it works in real life:

You give the computer a text prompt: "A blue clay bat."
You give it a reference image: "Make it look like a charcoal drawing."
You use a slider to control the Gloss: "Make it matte," or "Make it super glossy."

The computer uses the "Shine Knob" from the researchers' specialized brain to adjust the image perfectly, keeping the charcoal style intact while changing the shine.

5. Why This Matters

For Artists: It gives them a new digital tool to paint with. They can create a scene and then decide, "I want this character to look wet and shiny, but keep the rest of the painting dry and matte," without having to repaint the whole thing.
For Science: It proves that computers can learn to see the world the way humans do. Just like humans can tell a shiny apple from a matte one even in a sketch, this AI learned to separate those concepts on its own, without being explicitly told "this is gloss."

The Bottom Line

The researchers built a "translator" that understands the difference between how something looks (the artistic style) and what it feels like (the material properties like shininess). They found the exact "switches" in the AI's brain that control these features, allowing us to create beautiful, controllable, non-photorealistic art with a level of precision that wasn't possible before.

1. Problem Statement

Non-Photorealistic Rendering (NPR) aims to generate images with artistic styles (e.g., oil painting, charcoal) rather than photorealism. While recent diffusion models have achieved high visual fidelity, they suffer from a lack of fine-grained controllability over specific perceptual attributes.

The Core Challenge: Humans can perceive material properties like gloss (shininess) independently of the artistic style used to depict them. However, existing generative models struggle to disentangle these factors. Text prompts often fail to provide precise, continuous control over gloss levels without altering the style or geometry.
Limitations of Current Methods:
- GANs: Offer good latent space interpretability but limited generative capacity and image quality compared to diffusion models.
- Diffusion Models: Produce high-quality images but lack disentangled latent spaces, making it difficult to manipulate specific attributes (like gloss) predictably without affecting others.
- Existing NPR Models: Often rely on text prompts that are ambiguous or fail to separate style from material properties, leading to inconsistent results.

2. Methodology

The authors propose a two-stage pipeline: first, training an unsupervised generative model to discover a disentangled latent space; second, connecting this space to a diffusion model for controlled synthesis.

A. Dataset Construction

To train a model that understands gloss across styles, the authors curated a new dataset of 10,080 samples featuring:

3 Styles: Charcoal, Ink Pen, Oil Painting.
20 Geometries: Varied complexity.
7 Gloss Levels: Systematically varied using the roughness parameter of the Disney Principled BSDF.
4 Illuminations & 6 Colors.
Key Innovation: Unlike previous datasets where gloss variations came from different hand-painted references (introducing stroke pattern noise), this dataset uses a brushstroke map extraction technique. They divide a low-gloss painted sphere by its photorealistic counterpart to isolate the style texture, then apply this map to spheres with varying gloss levels. This ensures the model learns gloss variations independent of stroke patterns.

B. Unsupervised Latent Space Learning (Stage 1)

The authors utilize a StyleGAN2-ADA architecture combined with a pixel2style2pixel (pSp) encoder.

Architecture: The pSp encoder projects input images into an extended $W^+$ latent space (16 layers of 512-dimensional vectors).
Training: The system is trained unsupervised. The generator learns to synthesize images, and the encoder learns to invert them.
Discovery: Through analysis, the authors found that the latent space naturally organizes into a hierarchical structure:
- Early Layers ( $w_0$ – $w_5$ ): Encode global geometry and illumination.
- Intermediate Layers ( $w_6$ – $w_8$ ): Layer 6 specifically encodes gloss, while Layer 8 encodes artistic style.
- Late Layers ( $w_9$ – $w_{15}$ ): Encode surface color.
Validation: Quantitative analysis using conditional mutual information and t-SNE confirmed that gloss is disentangled from style and geometry. A linear regression on Layer 6 embeddings achieved a Spearman correlation of 0.97 with ground-truth gloss levels.

C. Diffusion-Based Synthesis Pipeline (Stage 2)

To leverage the high quality of diffusion models while retaining the controllability of the learned latent space, the authors introduced a lightweight adapter.

Backbone: Stable Diffusion XL 1.0.
Adapter Mechanism: The adapter takes the intermediate $W^+$ embeddings (specifically the style and gloss vectors identified in Stage 1) and conditions the diffusion process.
Multi-Modal Conditioning:
- Text Prompts: Control geometry, illumination, and color.
- Style/Gloss Reference: An input image is encoded via the pSp encoder; its specific $W^+$ layers are injected into the diffusion model to dictate style and gloss.
- Spatial Control: Optional Canny edge maps (via ControlNet) and albedo maps (via Marigold) allow for precise control over shape and base color.

3. Key Contributions

Discovery of Disentangled Gloss Representation: The paper demonstrates that an unsupervised generative model trained on diverse artistic styles spontaneously learns a hierarchical latent space where gloss is isolated in a specific layer (Layer 6), distinct from style (Layer 8) and geometry.
Curated NPR Dataset: Creation of a large-scale dataset with systematic, controlled variations in gloss across multiple styles, solving the issue of stroke-pattern confounding in previous datasets.
Style- and Gloss-Aware Diffusion Adapter: A novel pipeline that bridges the interpretability of GAN latent spaces with the generative power of diffusion models, enabling continuous, fine-grained control over gloss and style simultaneously.
Quantitative and Qualitative Validation: Comprehensive analysis showing that the method outperforms state-of-the-art text-to-image models (e.g., GPT Image 1, FLUX) and specialized NPR tools (Artist-Inator) in maintaining style fidelity while allowing smooth gloss traversal.

4. Results

Latent Space Analysis: The model successfully separated factors. Traversing Layer 6 changed gloss from matte to glossy without altering the artistic style or object geometry. Traversing Layer 8 changed the style (e.g., from charcoal to oil) without affecting gloss.
User Study: In a study with 22 participants, the proposed method was preferred over competitors (DEADiff, InstantStyle, Artist-Inator) in 93.18% of comparisons against Artist-Inator and 97.73% against DEADiff. Users rated it highest for style transfer and gloss control.
Gloss Control: Unlike text-based methods which struggle with continuous gloss transitions (often jumping between discrete states or altering the image structure), the proposed method allows for a smooth, slider-based traversal of gloss levels, preserving the underlying geometry and style.
Ablation Study: The method remains robust when increasing the number of styles in the dataset, showing that the disentanglement property is scalable.

5. Significance and Impact

Bridging Perception and Generation: The work provides strong evidence that deep generative models can learn human-like perceptual representations (separating material properties from style) without explicit supervision.
Controllable Content Creation: It offers artists and designers a tool to generate non-photorealistic images with precise control over material appearance, a capability previously limited to photorealistic rendering or requiring manual post-processing.
Interpretability: By mapping specific perceptual factors to specific latent layers, the paper advances the "black box" problem of diffusion models, offering a pathway to more interpretable and controllable generative AI.
Future Directions: The authors note limitations in generalizing to unseen styles (e.g., watercolor) without retraining and the lack of a dedicated albedo ControlNet for SDXL, suggesting these as areas for future research.

In summary, this paper presents a significant advancement in NPR by proving that unsupervised learning can discover disentangled perceptual factors, and by leveraging this discovery to create a highly controllable diffusion-based synthesis pipeline.