Few-Shot Generative Model Adaption via Identity Injection and Preservation

The Big Problem: The "Amnesia" Artist

Imagine you have a world-famous painter (let's call him Master G). Master G is incredible at painting realistic portraits of people. He has spent years studying thousands of faces, so he knows exactly how to draw a nose, an eye, or a smile.

Now, you want Master G to learn a new style: Van Gogh's swirling, colorful style. But there's a catch: you only have 10 paintings by Van Gogh to show him.

If you just tell Master G, "Paint like this," two bad things happen:

He forgets who he is: He tries so hard to copy the 10 Van Gogh paintings that he stops drawing realistic faces. The eyes look weird, the noses are wrong, and the "Master G" identity is lost.
He gets stuck: Because he only has 10 examples, he might just copy those 10 paintings over and over again, producing boring, repetitive art (this is called "mode collapse").

The Goal: We want Master G to paint in the style of Van Gogh, but keep the face of the person he is painting exactly the same.

The Solution: The "I2P" System

The authors of this paper created a method called I2P (Identity Injection and Preservation). Think of it as a special training camp for Master G that uses two main tricks to solve the problem.

Trick 1: Identity Injection (The "Memory Implant")

The Analogy: Imagine Master G is about to start a new painting, but he's nervous he'll forget what a human face looks like. Before he picks up the brush, you give him a "memory chip" containing the essence of a human face.

How it works in the paper:

The computer takes the "blueprint" (latent features) of a face from the original Master G.
It mixes this blueprint with the new "Van Gogh style" instructions.
It injects this mix back into the painter's brain.
Result: Even while learning the new style, the painter never forgets the core structure of a human face. He knows what to draw, even if he's learning how to draw it differently.

Trick 2: Identity Substitution & Preservation (The "Style vs. Content" Detangler)

The Analogy: Imagine you have a smoothie. It's a mix of Strawberry (Style) and Banana (Identity/Content).

Old methods tried to make a new smoothie by just blending everything together. Sometimes the banana flavor got lost, or the strawberry taste became too strong.
I2P uses a special machine that separates the smoothie back into pure Strawberry juice and pure Banana chunks.
- Step A: It takes the Banana chunks (the person's face) from the original Master G.
- Step B: It takes the Strawberry juice (the Van Gogh style) from the few examples you have.
- Step C: It mixes them back together perfectly.

The "Safety Net" (Consistency Constraints):
To make sure the machine doesn't mess up the mix, I2P uses three "safety checks" (Loss Functions):

Content Check: "Is the face still a face?" (Ensures the banana chunks are still there).
Style Check: "Does it look like Van Gogh?" (Ensures the strawberry flavor is strong).
Reconstruction Check: "If we take the style and content apart and put them back together, do we get the same picture?" This ensures the two parts fit together perfectly without creating a monster.

Why is this a Big Deal?

In the past, if you tried to teach an AI a new style with only 10 pictures, the result was usually a disaster. The AI would either:

Overfit: Copy the 10 pictures exactly, losing all creativity.
Forget: Lose the original subject's identity (e.g., the person's face would turn into a blob).

I2P fixes this by:

Injecting the original identity so it can't be forgotten.
Separating style from content so they don't get confused.
Checking the work constantly to ensure the face looks right and the style looks new.

The Results

The paper tested this on many different scenarios:

Turning photos of people into sketches.
Turning photos of people into babies.
Turning photos of cats into Impressionist paintings.

In every test, I2P produced images that looked like the new style but kept the original person's face perfectly recognizable. It beat all other current methods, even when the AI only had 5 or 10 examples to learn from.

Summary

Think of I2P as a super-tutor for an AI artist. Instead of just saying "Copy this," the tutor says:

"Here is the face you need to draw (Identity Injection). Now, let's separate the face from the paintbrush strokes (Decoupling). Paint the face using the new brushstrokes, but make sure the face doesn't change (Consistency). And if you mess up, we'll check the math to fix it."

This allows AI to learn new styles quickly without losing its memory of what it was originally good at.

1. Problem Statement

Context: Generative models (like GANs) typically require large-scale, high-quality datasets to train effectively. However, in many real-world scenarios, data is scarce.
Challenge: Adapting a pre-trained source generative model to a target domain with very few samples (e.g., fewer than 10 images, known as Few-Shot Generative Model Adaptation) presents severe challenges:

Overfitting: The model memorizes the few training samples, leading to "training set artifact replication."
Mode Collapse: The model loses diversity, generating repetitive images.
Identity Degradation: Existing methods often fail to preserve the identity knowledge (structural and semantic features) of the source domain while attempting to adopt the target domain's style. This results in distorted or unrecognizable generated images.
The Core Conflict: There is a fundamental tension between style adaptation (learning the new domain) and identity preservation (keeping the source structure). Current methods (kernel modulation or regularization) struggle to balance this, often sacrificing one for the other.

2. Methodology: Identity Injection and Preservation (I2P)

The authors propose I2P, a framework designed to inject source identity knowledge into the target domain's latent space and enforce consistency through decoupling and reconstruction. The framework consists of three core components:

A. Identity Injection Module

Goal: To embed source domain identity knowledge directly into the target domain's latent space ( $W^+$ ) before generation begins.
Mechanism: Inspired by Adaptive Instance Normalization (AdaIN), this module aligns the content features of the source latent vector ( $w^S$ ) with the style features of the target latent vector ( $w^T$ ).
Process:
1. Extract latent features from both source and target generators.
2. Inject source identity features into the target latent vector using a weighted blending formula:
  $w'^T = (1-\alpha) \cdot w^T + \alpha \cdot \left[ \frac{\sigma(w^S)(w^T - \mu(w^T))}{\sigma(w^T)} + \mu(w^S) \right]$
  Where $\alpha$ is a hyperparameter controlling injection depth.
3. The resulting enriched latent vector guides the target generator to retain source identity while learning target styles.

B. Identity Substitution Module

This module decouples image features to allow for precise control over style and content.

Style-Content Decoupler:
- Uses a CLIP image encoder to extract deep features from source images, target images, and raw training samples.
- A lightweight network (Conv + Linear layers) disentangles these features into Style features ( $S$ ) and Content features ( $C$ ).
- Ensures $S$ and $C$ are linearly independent.
Reconstruction Modulator:
- Reconstructs new deep features by fusing style and content features from different domains using AdaIN.
- This creates "synthesis features" ( $M$ ) that represent a controlled combination of source identity and target style.

C. Identity Consistency Constraints

To prevent identity drift during training, the authors enforce three specific loss functions based on the distributions of the features extracted above:

Content Constraint ( $L_c$ ): Aligns the content distribution of the source domain ( $P_{CS}$ ) with the target domain ( $P_{CT}$ ) using Smooth-L1 loss. This ensures structural identity is preserved.
Style Constraint ( $L_s$ ): Aligns the style distribution of the target domain ( $P_{SS}$ ) with the raw training set ( $P_{SR}$ ) using Smooth-L1 loss. This ensures the target style is learned.
Synthesis Constraint ( $L_r$ ): A novel constraint that ensures the reconstructed synthesis features ( $M$ ) maintain the integrity of the deep feature distributions. Instead of direct numerical alignment, it uses Cosine Similarity to align the spatial directionality of the synthesized distributions ( $P_{M}^{CSSR}, P_{M}^{CTSR}, P_{M}^{CSST}$ ). This prevents the entanglement of style and content.

Total Loss:
$L_{total} = L_{adv} + \lambda \cdot (L_c + L_s + L_r)$

3. Key Contributions

I2P Framework: A novel method specifically for few-shot adaptation that successfully balances style transfer and identity preservation.
Identity Injection: A mechanism to explicitly inject source identity knowledge into the target latent space, mitigating identity drift caused by random sampling.
Identity Substitution & Consistency: A dual-module approach (Decoupler + Modulator) combined with a closed-loop constraint system ( $L_c, L_s, L_r$ ) that prevents style-content entanglement and ensures robust cross-domain alignment.
State-of-the-Art Performance: Demonstrated superior results across multiple datasets and metrics compared to existing baselines (TGAN, FreezeD, CDC, RSSA, PIR, SGP).

4. Experimental Results

The method was evaluated on multiple source-target pairs (e.g., FFHQ $\to$ Sketches, FFHQ $\to$ Babies, LSUN-Churches $\to$ Van Gogh) with extremely limited data (5-shot and 10-shot).

Qualitative Results:
- I2P generates high-fidelity images that retain source identity (e.g., facial structure, object shape) while accurately adopting target styles (e.g., sketch lines, artistic strokes).
- Competing methods showed overfitting (copying training artifacts), mode collapse (lack of diversity), or identity distortion.
Quantitative Results:
- FID (Fréchet Inception Distance): I2P achieved the lowest FID scores across all datasets (e.g., 38.16 on Sketches vs. 45.01 for PIR), indicating better distribution matching.
- Intra-LPIPS: I2P achieved the highest Intra-LPIPS scores, proving it maintains superior image diversity and avoids mode collapse.
- Identity Metrics (DINO, CLIP-I, CLIP-T): I2P outperformed baselines in preserving structural and semantic identity (DINO/CLIP-I) while effectively transferring style (CLIP-T).
Efficiency: I2P demonstrated competitive computational efficiency, requiring less memory (14.7 GB) and time (88 mins) compared to complex baselines like PIR (19.2 GB, 130 mins).

5. Significance and Impact

Solving the "Identity Drift" Problem: The paper addresses a critical failure mode in few-shot learning where models lose the "essence" of the source object. I2P provides a robust solution by explicitly decoupling and re-injecting identity.
Data Efficiency: The method enables high-quality generative adaptation with as few as 5 to 10 samples, making it highly applicable to domains where data collection is expensive or difficult (e.g., medical imaging, rare art styles).
Theoretical Insight: The introduction of the Synthesis Constraint ( $L_r$ ) using cosine similarity offers a new perspective on how to constrain feature distributions without causing the over-smoothing or distortion seen in direct alignment methods.
Limitations: The authors note that the method relies on the quality of identity-preserving transformations and may struggle in domains with abstract or inconsistent identity concepts (e.g., Human $\to$ Cat).

In summary, I2P represents a significant advancement in few-shot generative modeling by treating identity preservation not just as a regularization term, but as an active injection and decoupling process, resulting in more stable, diverse, and faithful cross-domain generation.