Data-Efficient Brushstroke Generation with Diffusion Models for Oil Painting

Imagine you want to teach a robot how to paint like a human master. You don't want the robot to just copy a photo pixel-by-pixel; you want it to understand the soul of painting: the messy, textured, imperfect swipe of a brush that leaves a unique mark on the canvas.

The problem? Most AI models are like students who only learn from massive libraries of textbooks (millions of images). But real brushstrokes are rare. There aren't millions of "brushstroke" photos floating around the internet. You have to go to an artist's studio and scan a few hundred physical strokes.

This paper introduces StrokeDiff, a new way to teach an AI to paint with just a tiny handful of examples (about 470 strokes), without the AI getting confused or "hallucinating" weird blobs.

Here is the breakdown using some everyday analogies:

1. The Problem: The "Empty Classroom"

Usually, AI learns by looking at millions of pictures. If you only give it 470 examples of brushstrokes, it gets lost. It's like putting a student in a classroom with only 470 pages of a textbook and asking them to write a novel. They might start making things up that look nothing like the real thing, or they might just copy the same few pages over and over (this is called "mode collapse").

2. The Solution: The "Ghost Teacher" (Smooth Regularization)

The authors came up with a clever trick called Smooth Regularization (SmR).

Imagine you are trying to teach a student to draw a specific type of leaf. You show them one leaf. But to help them understand the shape without forcing them to memorize every single vein, you occasionally whisper, "Hey, remember that other leaf we saw yesterday? It had a similar curve."

In the AI's training, the system randomly grabs a different brushstroke from the dataset and mixes it into the learning process as a "hint."

The Magic: This hint is like a ghost teacher. It nudges the AI in the right direction so it doesn't get lost in the noise.
The Catch: Once the AI is trained, the ghost teacher disappears. When you actually ask the AI to paint a picture, it doesn't need those extra hints. It just uses what it learned. This makes the system very efficient and easy to use.

3. The Control: The "Bezier Blueprint"

Once the AI knows how to make a beautiful, messy brushstroke, how do you tell it where to put it?

The authors added a Bézier-based conditioning module. Think of this as giving the AI a stencil or a blueprint.

Instead of just saying "paint a red blob," you can say, "Draw a curved line here, with these specific control points, and make it thick here and thin there."
This turns the AI from a random artist into a controlled craftsman. You can dictate the shape and placement, and the AI fills it in with the realistic, textured paint it learned to create.

4. The Painting Process: The "Layer Cake"

Real oil painting isn't just one flat layer; it's layers of paint on top of layers. If you paint the background after the foreground, it looks wrong.

The paper also built a ranking system (like a conductor for an orchestra).

Before the AI paints the whole picture, it figures out the order of operations.
It decides: "First, paint the big background strokes. Then, paint the middle ground. Finally, add the tiny details on top."
This prevents the AI from painting a tree behind a house that should be in front of it, ensuring the final image looks like a coherent, layered painting rather than a messy pile of pixels.

The Result: From "Digital Photo" to "Oil Painting"

When they tested this, the results were impressive:

Texture: Unlike other methods that look like smooth, plastic digital art, StrokeDiff produces strokes that look like real oil paint—thick, textured, and irregular.
Variety: Even with only 470 training examples, the AI could generate thousands of unique, non-repeating strokes.
Human Approval: When real humans looked at the paintings, they rated them higher for "style" and "texture" than other AI methods, saying they felt more like a human artist made them.

In a Nutshell

This paper is about teaching a computer to be a master painter's apprentice using a tiny library of samples. They did it by:

Giving the AI a "ghost teacher" to keep it on track during learning.
Giving it a "blueprint" to control exactly where the brush goes.
Teaching it the "rules of layering" so the final painting makes sense.

The end result is a system that can turn a photograph into a textured, expressive oil painting that feels alive, all while needing very little data to get there.

1. Problem Statement

The paper addresses the challenge of generating human-like brushstrokes for oil painting using Diffusion Models (DMs) under conditions of extreme data scarcity.

Data Scarcity: Unlike natural images, high-quality, process-oriented brushstroke data is difficult to collect at scale. The authors work with a dataset of only 470 hand-drawn strokes.
Limitations of Existing Methods:
- Template/Vector-based: Traditional Stroke-Based Rendering (SBR) relies on predefined templates or geometric curves (e.g., Bézier splines), which lack the irregularity, texture, and spontaneity of human painting.
- Standard Diffusion/GANs: Fine-tuning large pre-trained diffusion models (like Stable Diffusion) on small datasets leads to mode collapse (producing repetitive, low-variation outputs) or structural degradation because the models cannot learn the sparse semantic signals of brushstrokes without massive data.
Goal: To create a framework that learns expressive, diverse, and structurally coherent brushstrokes from a few hundred samples and integrates them into a controllable, end-to-end painting pipeline.

2. Methodology: StrokeDiff

The authors propose StrokeDiff, a diffusion-based framework featuring two core innovations: Smooth Regularization (SmR) for training stability and a Bézier-based conditioning module for controllability.

A. Smooth Regularization (SmR)

SmR is a training-time strategy designed to stabilize diffusion models when data is sparse, without altering the inference process.

Mechanism: During the forward diffusion process, the model injects a stochastic visual prior ( $x_s$ $x_{s}$ ) alongside the standard noise.
- At each timestep $t$ , a random brushstroke $x_s$ is sampled from the training dataset.
- The forward process is modified to interpolate the noisy target image with this prior: $x'_t = x_t + \sqrt{1-\bar{\alpha}_t}\sqrt{\eta}x_s - \sqrt{1-\bar{\alpha}_t}\sqrt{\eta}\epsilon^*$ .
- Here, $\eta$ is a scaling factor sampled from a uniform distribution, and $\epsilon^*$ is new noise to maintain variance.
Effect: This injects weak but diverse structural cues during training, preventing the model from collapsing into a single mode or losing global structure.
Inference: Crucially, SmR is deactivated during inference ( $\eta=0$ ). The model generates strokes purely from Gaussian noise, requiring no additional conditions or prompts at test time.

B. Controllable Stroke Synthesis

To make the generated strokes usable in a painting pipeline, the authors introduce a parametric conditioning module.

Vectorization: Strokes are approximated using a single cubic Bézier curve defined by control points, color, opacity, and width ( $c_p$ ).
Conditioning: A differentiable rasterizer initializes these parameters. The vector $c_p$ is concatenated with contextual features (e.g., style embeddings) and projected into the U-Net's cross-attention layers.
Result: This allows users to control stroke shape, placement, and attributes while maintaining the high-fidelity texture learned by the diffusion model.

C. Painting Pipeline Integration

The system includes a complete pipeline for generating full paintings:

Stroke Prediction: A DETR-style network predicts stroke parameters (position, shape) and a ranking score ( $s_{rcr}$ ) for each region.
Ranking Loss: A specific loss function is introduced to regularize the predicted order of strokes. This ensures that strokes are rendered in a logical sequence (e.g., background before foreground), mitigating overlap artifacts and enhancing layering.
Compositing: Strokes are generated by StrokeDiff and rendered sequentially on a canvas based on the predicted order.

3. Key Contributions

Smooth Regularization (SmR): A novel, lightweight training strategy that stabilizes diffusion models on sparse datasets by injecting stochastic visual priors. It improves generalization without requiring test-time conditions.
Controllable Primitive Generation: A Bézier-parametrized conditioning module that bridges the gap between raster-based diffusion generation and vector-based stroke control, enabling precise manipulation of stroke attributes.
End-to-End Painting Pipeline: A complete system integrating stroke prediction, generation, ordering, and compositing, demonstrating that data-efficient primitive modeling can support structured, expressive content creation.
Comprehensive Evaluation: Multi-dimensional evaluation using automatic metrics (FID, CRD, CLIP) and human studies, validating the method against state-of-the-art GANs, VAEs, and template-based approaches.

4. Experimental Results

The authors evaluated StrokeDiff on a dataset of 470 strokes (augmented to 9,400 images).

Quantitative Performance:
- FID (Fréchet Inception Distance): StrokeDiff achieved a significantly lower FID (54) compared to LoRA fine-tuning (285) and noise scheduling methods (251+), indicating superior distribution alignment.
- CRD (Closed Region Detection): The method produced structurally coherent strokes with a CRD score of 1.24, whereas baselines suffered from mode collapse (scores >44), producing fragmented or blob-like artifacts.
- Controllable Generation: In controllable settings, StrokeDiff outperformed StrokeGAN, Stylized, and RobPaint in LPIPS, MSE, and FID.
Qualitative Results:
- Generated strokes exhibited rich texture, irregularity, and layering, closely mimicking human oil painting styles.
- Compared to geometric or template-based methods, StrokeDiff produced paintings with better artistic abstraction and texture depth.
User Study:
- In a study with 51 participants, StrokeDiff received the highest scores for Style (3.38), Aesthetics (3.38), and Texture (3.50).
- While "Content Retention" was slightly lower (2.59) due to the inherent abstraction of the style, the trade-off was deemed acceptable for artistic rendering.

5. Significance and Impact

Data Efficiency: The paper demonstrates that high-quality generative modeling for specialized domains (like art primitives) is possible with very small datasets (<500 samples) by modifying the training objective rather than relying on massive data or complex test-time engineering.
Process-Aware Rendering: By focusing on the primitive (the stroke) rather than just the final image, the method enables "process-aware" content creation, allowing for controllable, layered, and stylistically consistent digital paintings.
Broader Applications: The SmR technique is shown to be potentially generalizable to other low-data tasks (e.g., inpainting), and the framework opens doors for applications in robotic painting, creative support tools, and 2.5D printing where physical texture replication is required.

In conclusion, StrokeDiff successfully bridges the gap between the data-hungry nature of modern diffusion models and the data-scarce reality of artistic primitives, offering a robust solution for expressive, controllable, and data-efficient digital art generation.