Structure-Level Disentangled Diffusion for Few-Shot Chinese Font Generation

Imagine you want to create a new font for a brand, but you only have a few photos of a handwritten letter "A" as a reference. You want the computer to take the shape of every other letter in the alphabet and paint them in that specific handwritten style.

This is the challenge of Few-Shot Chinese Font Generation. It's incredibly hard because Chinese characters are complex (like intricate puzzles with thousands of pieces), and if the computer gets even a tiny bit wrong, the character might look like a different word entirely.

Here is how the authors of this paper, SLD-Font, solved this problem, explained through simple analogies.

1. The Problem: The "Confused Chef"

Imagine a chef trying to cook a dish in a specific style (say, "Spicy Sichuan") using a recipe book.

Old Methods: The chef would look at the recipe (content) and the spicy sauce (style) and mix them together in one big bowl before cooking. The problem? The flavors get muddled. Sometimes the chef forgets the recipe and just makes a spicy mess (content distortion). Other times, the dish tastes bland because the spice got lost (poor style transfer).
The Result: The old AI models were like this confused chef. They could separate "content" and "style" only at a high level, but during the actual creation, the two got tangled up again, ruining the final product.

2. The Solution: The "Specialized Assembly Line" (Structure-Level Disentanglement)

The authors propose a new factory called SLD-Font. Instead of mixing everything in one bowl, they set up a strict assembly line with two separate tracks that never cross until the very end.

Track A (The Blueprint): This track holds the Content. They use a standard, clear font (like "SimSun") as the blueprint. This ensures the skeleton of the character is perfect. The AI treats this as the "truth" of what the character is.
Track B (The Paintbrush): This track holds the Style. They use a smart camera (CLIP) to look at your few reference photos and extract the "vibe" (thickness of lines, curves, texture).
The Magic: The AI builds the character using the Blueprint from Track A, but it paints it using the instructions from Track B. Because the two tracks are separate, the AI never gets confused about what it is drawing versus how it should look. It's like having a master architect (Content) and a master decorator (Style) working together without stepping on each other's toes.

3. The Cleanup Crew: Removing "Static" (BNR Module)

When computers generate images using a "latent space" (a compressed, abstract version of an image), it's like looking at a photo through a slightly foggy window. You might see the character, but there's also some weird gray fuzz or "static" around the edges, especially in complex areas.

The Analogy: Imagine you drew a perfect picture, but the printer left a little bit of dust on the white paper.
The Fix: The authors added a Background Noise Removal (BNR) module. Think of this as a digital eraser or a high-tech vacuum cleaner. It looks at the final image, sees the "dust" (noise) in the white spaces, and wipes it away, leaving you with a crisp, clean character.

4. The Smart Tuner: Learning Without Over-Training (PEFT)

Usually, to teach a computer a new style, you have to retrain the whole brain. But if you only have 5 photos, the computer might memorize those 5 photos too well and forget how to draw new things. It's like a student who memorizes the answers to a practice test but fails the real exam because they didn't learn the concepts.

The Strategy: The authors used Parameter-Efficient Fine-Tuning (PEFT).
The Analogy: Instead of rewriting the entire textbook (retraining the whole model), they just put a few sticky notes on the "Style" chapters. They tell the computer: "You already know how to draw letters perfectly. Just change the way you color them."
The Result: The AI learns the new style quickly without forgetting how to draw the letters correctly. It adapts to the new "vibe" without overfitting (memorizing) the specific examples.

5. The New Scorecard: "Grey" and "OCR"

How do you know if the font is good?

Old way: Did the colors look nice? (Visual similarity).
New way (SLD-Font):
1. The "Grey" Test: They check the white space. Is it pure white, or is it gray with noise? A high score means a clean, professional look.
2. The "OCR" Test: They feed the generated character into a robot reader (OCR). If the robot can read it correctly, the character is structurally sound. If the robot gets confused, the character is broken.

Summary

SLD-Font is like a master craftsman who:

Keeps the structure (the skeleton of the letter) separate from the style (the paint).
Uses a vacuum cleaner to remove any digital dust.
Uses sticky notes to learn new styles quickly without forgetting the basics.

The result? A computer that can take a few scribbles and turn them into a full, professional, and perfectly readable Chinese font, ready for use in books, apps, or branding.

1. Problem Statement

Few-Shot Chinese Font Generation (FFG) aims to synthesize new characters in a target style using only a handful of reference images. This task faces two primary challenges:

Complexity of Chinese Characters: Unlike Latin scripts, Chinese characters have complex structures and a vast quantity (e.g., GB18030 covers over 27,000 characters). Manual design is labor-intensive.
Content-Style Disentanglement: Existing methods (GANs and Diffusion models) typically achieve only feature-level disentanglement. They extract content and style features separately but fuse them early in the generation process. This leads to re-entanglement, where the generator confuses content with style, resulting in:
- Content Distortion: Structural errors or incorrect character shapes.
- Style Fidelity Loss: Inability to fully capture specific style attributes (stroke thickness, width, connectivity).
- Overfitting: In few-shot scenarios, models often overfit to the specific content of the reference images rather than learning the general style, failing when generating unseen characters.

2. Methodology: SLD-Font

The authors propose SLD-Font, a Structure-Level Disentangled Diffusion Model built upon the Latent Diffusion Model (LDM) framework. The core innovation is separating content and style processing at the structural level rather than just the feature level.

A. Structure-Level Disentanglement Architecture

The model processes content and style through distinct pathways within the U-Net:

Content Pathway (Primary Function):
- Uses a SimSun (standard Chinese font) image as a content template.
- The SimSun image is concatenated with the noisy latent target image along the channel dimension.
- This concatenated tensor is fed into the U-Net, ensuring the backbone focuses on preserving the structural integrity of the character.
Style Pathway (Modulation):
- Style features are extracted from target-style reference images using a pre-trained CLIP image encoder.
- These style embeddings are injected into the U-Net via Cross-Attention mechanisms in Transformer blocks.
- This allows style information to modulate the generation process without altering the underlying content structure.

B. Background Noise Removal (BNR) Module

Problem: Latent Diffusion Models rely on VAEs for encoding/decoding, which often introduces artifacts and noise, particularly in dense stroke regions of Chinese characters.
Solution: A Background Noise Removal (BNR) module operates in pixel space (not latent space).
- It takes the VAE-decoded image and the SimSun source image as input.
- It is trained using a combination of $L_1$ loss, Sobel-based edge loss, and VGG perceptual loss to refine stroke sharpness and remove background noise while preserving content integrity.

C. Parameter-Efficient Fine-Tuning (PEFT)

Strategy: To adapt to new styles without overfitting to reference content, the authors freeze the content-related parameters ( $\theta_c$ ) and only update the style-related parameters ( $\theta_s$ ).
Style-Related Components: Includes the Cross-Attention $K$ and $V$ projection matrices, Transformer blocks, and the final projection layer of the CLIP model.
Theoretical Basis: Gradient analysis shows that style-related parameters are highly sensitive to style variations, while content-related parameters are sensitive to content variations. By freezing content parameters, the model adapts to the target style while retaining the ability to generate unseen characters accurately.

3. Key Contributions

Structure-Level Disentanglement: Unlike previous feature-level approaches, SLD-Font separates content and style at the architectural level (input concatenation vs. cross-attention modulation), preventing re-entanglement and ensuring structural fidelity.
Background Noise Removal (BNR): A novel pixel-space module specifically designed to eliminate VAE-induced artifacts in complex Chinese stroke regions, significantly improving visual quality.
PEFT for Few-Shot Font Generation: The first application of parameter-efficient fine-tuning in this domain based on structure-level disentanglement. It enables rapid adaptation to new styles with minimal data while preventing content overfitting.
New Evaluation Metrics:
- Grey Metric: Quantifies background noise by comparing grayscale histograms of generated vs. target images.
- OCR Metric: Uses PaddleOCR and a custom ResNet-based model to evaluate character recognition accuracy, ensuring the generated characters are semantically correct.

4. Experimental Results

The model was evaluated on 900 Chinese font styles (840 training, 60 unseen) under two scenarios: Seen Characters/Unseen Fonts (SCUF) and Unseen Characters/Unseen Fonts (UCUF).

Quantitative Performance:
- SLD-Font + PEFT achieved state-of-the-art (SOTA) results across all metrics.
- Style Metrics: Achieved the best SSIM (0.505), LPIPS (0.175), and FID (0.521) in the SCUF setting, significantly outperforming methods like MSDFont and FontDiffuser.
- Content Metrics: Maintained high OCR accuracy (0.991) and Grey scores (0.998), demonstrating that style adaptation did not compromise character correctness.
- Comparison: While full fine-tuning (ALL parameters) improved style metrics, it caused a significant drop in content accuracy (OCR dropped to 0.973) due to overfitting. PEFT balanced this trade-off perfectly.
Qualitative Analysis:
- Visualizations show SLD-Font successfully handles complex style variations (width, stroke thickness, connectivity) where other models fail (e.g., producing distorted structures or noise in dense strokes).
- The model effectively generalized to handwritten fonts, a challenging domain with high variability, producing visually consistent results.
Efficiency:
- Fine-tuning time is comparable to other strategies (~121 seconds per style on a V100 GPU) because the computational bottleneck lies in data processing/encoding, not parameter updates.

5. Significance

This paper addresses a critical bottleneck in Chinese font generation: the trade-off between style fidelity and content accuracy. By moving from feature-level to structure-level disentanglement, SLD-Font provides a robust framework for generating high-quality, semantically correct Chinese characters from minimal references.

The introduction of PEFT specifically tailored for font generation offers a practical solution for real-world applications where data is scarce (e.g., restoring historical documents or creating personalized branding). Furthermore, the proposed Grey and OCR metrics provide a more rigorous evaluation standard for font generation tasks, moving beyond simple pixel-wise similarity to assess structural and semantic correctness.