Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Imagine you have a favorite family photo, but someone has taken a giant, jagged chunk out of the middle of it—maybe covering the person's eyes and nose. Your goal is to fill that missing hole with new pixels so the face looks whole and natural again. This is called Image Inpainting.

For a long time, computers were terrible at this. If you asked them to fill in the hole, they often gave you a blurry smear, or they drew an eye where a mouth should be, or the colors didn't match the rest of the photo. It was like trying to fix a broken vase with melted wax.

This paper introduces a new, smarter way to fix these photos using a system they call a "Semantic-Guided Two-Stage GAN." That's a mouthful, so let's break it down into a simple story using a Master Architect and a Master Painter analogy.

The Problem: Why Old Methods Failed

Previous computer programs tried to guess the missing pixels directly, like a student guessing answers on a test without studying. They looked at the pixels around the hole and tried to guess what color should go next.

The Result: They often got the "big picture" wrong (like putting an eye on a cheek) or the details were fuzzy (like a watercolor painting left in the rain).

The Solution: The Two-Stage Team

The authors built a two-step system that separates planning from painting.

Stage 1: The Master Architect (Semantic Layout Generation)

Before painting a single stroke, you need a blueprint.

What it does: This stage looks at the broken photo and asks, "What should be in this hole?" It doesn't worry about skin texture or hair strands yet. It just figures out the structure: "Okay, the left side of the nose is here, so the right side must be there. The left eye is visible, so the right eye goes here."
The Secret Sauce (Hybrid Encoding): To do this, the Architect uses two different "brains" working together:
1. The CNN Brain: Good at seeing small, local details (like the curve of a lip).
2. The Transformer Brain: Good at seeing the big picture and long-distance connections (like how the left eye relates to the right ear).
- Analogy: Imagine the CNN is a bricklayer who knows how to lay a single brick perfectly, and the Transformer is an architect who knows how the whole building stands up. By combining them, the Architect creates a perfect, probabilistic map of the face's structure.

Stage 2: The Master Painter (Texture Synthesis)

Now that we have the blueprint, we need to paint it.

What it does: This stage takes the "blueprint" from Stage 1 and fills it with realistic skin, hair, and shadows.
The Secret Sauce (Multi-Modal Texture): Instead of just copying pixels from the known parts of the photo, this painter looks at the blueprint and pulls in information from different scales. It ensures the skin texture matches the surrounding area perfectly and that the lighting is consistent.
The "Magic" Touch: To make the results look natural and not robotic, the painter adds a tiny bit of "creative chaos" (random noise). This means if you run the same photo through the system twice, you might get two slightly different, but equally realistic, versions of the missing face. It mimics how a human artist might make slightly different choices each time they sketch.

How They Trained It (The "School" System)

Training a computer to do this is hard because it can easily get confused. The authors used a special training schedule:

Phase 1 (The Sketch): They taught the system to just get the colors and shapes roughly right.
Phase 2 (The Details): They slowly introduced stricter rules, forcing the system to pay attention to the "blueprint" and the texture details.
Phase 3 (The Polish): They let the system refine everything, ensuring the edges blend smoothly so you can't tell where the original photo ends and the new part begins.

They also used a "Judge" (called a Discriminator) that constantly critiqued the work, asking, "Does this look like a real human face, or does it look like a fake drawing?" The system kept improving until the Judge was fooled.

The Results

When they tested this on famous face datasets (CelebA-HQ and FFHQ), the results were impressive:

Sharper: The images weren't blurry.
Smarter: The eyes and mouths were in the right places.
Smoother: The edges where the new pixels met the old pixels were invisible.

The Bottom Line

Think of this paper as teaching a computer to think before it acts. Instead of blindly guessing pixels, the computer first draws a mental map of the face (Stage 1) and then paints the details based on that map (Stage 2). By using a team of specialized "brains" (CNNs and Transformers) and a strict training routine, they managed to fix broken faces with a level of realism that previous methods couldn't achieve.

In short: It's like giving the computer a blueprint and a set of high-quality paints, rather than just telling it to "guess what goes in the hole."

1. Problem Statement

The paper addresses the challenge of facial image inpainting, which involves restoring missing or corrupted regions in face images while preserving identity, structural consistency, and photorealistic quality.

Key Challenges Identified:

Semantic Inconsistency: Direct pixel-level synthesis often violates facial geometry constraints, leading to misaligned eyes or distorted boundaries.
Texture Blurriness: Methods relying solely on $\ell_1$ or $\ell_2$ reconstruction losses tend to produce over-smoothed results lacking high-frequency details.
Boundary Artifacts: Insufficient attention to mask boundaries causes visible seams or color mismatches between inpainted and known regions.
Limited Diversity: Many models produce deterministic outputs, failing to capture the multiple plausible ways a face could be completed.

Existing methods (attention-based, coarse-to-fine, or transformer-based) struggle with large, irregular masks, often failing to balance local texture detail with global structural coherence.

2. Methodology

The authors propose a Semantic-Guided Two-Stage GAN architecture that decouples semantic layout generation from texture synthesis. The framework utilizes a Hybrid Perceptual Encoding approach combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViT).

A. Network Architecture

The model operates in two distinct stages:

Stage 1: Perception-Aware Semantic Layout Generation

Hybrid CNN-Transformer Encoder: To robustly encode partially masked inputs, a dual-branch encoder is used:
- CNN Branch: Extracts local texture priors using residual blocks with stride-2 convolutions.
- Transformer Branch: Models long-range dependencies by treating the input as a sequence of patches (using 6 transformer layers, 8 attention heads, and 768 hidden dimensions).
- Fusion: Features from both branches are fused via a $1\times1$ convolution.
Semantic Layout Generator: A decoder predicts a probabilistic semantic map ( $S$ ) with 20 classes (facial components like eyes, nose, mouth). This map provides a structural guide for the next stage, ensuring semantic coherence before texture synthesis begins.

Stage 2: Multi-Modal Texture Generation

Multi-Resolution Contextual Attention: A module gathers information from known regions at multiple scales ( $s \in \{1, 2, 4\}$ ) to fill missing areas. It explicitly masks attention from missing-to-missing regions to ensure information flows only from valid pixels.
Stochastic Texture Synthesis: To enable diverse outputs, Gaussian noise is injected at multiple decoder layers during inference. This allows the model to generate different realistic completions for the same input mask.
Output: The final image is synthesized by combining the semantic layout, encoded features, and noise, passed through a tanh activation.

B. Discriminator Design

Three discriminators are employed to ensure high-quality generation:

Global Discriminator ( $D_g$ ): Checks overall image realism (Standard CNN with spectral normalization).
Local Discriminator ( $D_l$ , PatchGAN): Assesses local texture realism on overlapping patches to generate high-frequency details.
Semantic-Aware Discriminator ( $D_s$ ): Conditions on the predicted semantic layout to enforce structural consistency.

C. Loss Functions

The training objective balances multiple goals:

Reconstruction Loss ( $L_{rec}$ ): Pixel-wise $\ell_1$ loss for basic color matching.
Semantic Consistency Loss ( $L_{sem}$ ): Cross-entropy loss on known regions to align predicted semantics with ground truth.
Multi-Scale Perceptual Loss ( $L_{perc}$ ): Uses features from VGG-19 layers to capture low-level textures and high-level semantics.
Contextual Boundary Loss ( $L_{ctx}$ ): Computes gradients in boundary regions to ensure smooth blending.
Adversarial Loss ( $L_{adv}$ ): Uses WGAN-GP (Wasserstein GAN with Gradient Penalty) for stable training.

D. Progressive Training Strategy

To prevent mode collapse and ensure stability, a three-phase training schedule is used:

Phase 1 (Epochs 1-20): Focuses on reconstruction with simplified loss.
Phase 2 (Epochs 21-50): Introduces full losses with adaptive scheduling, gradually increasing weights for semantic, perceptual, and boundary terms.
Phase 3 (Epochs 51-250): Stabilization with fixed loss weights and reduced discriminator update frequency.

3. Key Contributions

Hybrid CNN-Transformer Encoder: A novel design that leverages CNNs for local texture priors and Transformers for global structural reasoning, extracting robust features even with missing data.
Probabilistic Semantic Guidance: Instead of direct RGB prediction, the model first generates a probabilistic semantic map, providing clear structural direction while allowing flexibility for diverse outcomes.
Multi-Modal Texture Synthesis: The integration of stochastic noise injection enables the generation of diverse, realistic completions rather than a single deterministic result.
Comprehensive Loss Formulation: The combination of WGAN-GP, multi-scale perceptual loss, and boundary-aware losses ensures stable convergence and high perceptual quality.

4. Experimental Results

The model was evaluated on CelebA-HQ and FFHQ datasets (128 $\times$ 128 resolution).

Quantitative Performance:
- PSNR: 24.8 dB
- SSIM: 0.912
- FID: 15.3
- LPIPS: 0.08 (lower is better)
- Note: The paper acknowledges a resolution mismatch with some SOTA methods (which use 512 $\times$ 512), so qualitative comparisons were emphasized. However, ablation studies on the 128 $\times$ 128 validation set confirmed the superiority of the full model.
Ablation Studies:
- The Hybrid + Attention configuration outperformed "CNN only," "ViT only," and "Hybrid without attention" variants across all metrics (PSNR, SSIM, LPIPS, FID).
- Removing the attention module reduced texture fidelity, while removing the hybrid encoder degraded structural coherence.
Efficiency:
- Parameters: 51.6M total.
- Inference Speed: 88.53 FPS on an NVIDIA RTX 3060 (11.3 ms per image).
- Training Time: ~9 days for 250 epochs.
Generalization: The model showed strong transferability between CelebA-HQ and FFHQ but weaker performance on Places2 (scene images), indicating it learns face-specific priors effectively.

5. Significance and Conclusion

This paper presents a significant advancement in face inpainting by explicitly separating semantic structure from texture synthesis. By utilizing a hybrid encoder and a two-stage GAN framework, the method effectively addresses the "blurry" and "structurally inconsistent" issues prevalent in previous deep learning approaches.

Significance:

It offers a robust solution for large, irregular masks, a common failure point for existing models.
The probabilistic semantic approach allows for diverse, realistic completions, moving beyond deterministic outputs.
The progressive training strategy and WGAN-GP integration provide a stable training pipeline for complex generative tasks.

Limitations & Future Work:

The model occasionally struggles with extremely large masks on complex faces and fine details like individual hair strands.
Future work aims to extend the architecture to 512 $\times$ 512 resolution to enable direct, fair quantitative comparisons with other state-of-the-art methods trained at higher resolutions.