No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Imagine you have a magical art machine (a Latent Diffusion Model) that learned to draw pictures by studying a secret photo album. You suspect this machine memorized specific photos from that album and can reproduce them almost perfectly.

You want to catch the machine in the act: Did it memorize this specific photo you're holding, or did it just learn the general style? This is called a Membership Inference Attack (MIA).

The Problem: The Missing Recipe

Usually, to test if the machine memorized a photo, you need two things:

The Photo.
The Exact Caption (the "recipe") that was used to teach the machine about that photo.

Think of the caption as the specific instruction: "A golden retriever wearing a red hat." If you give the machine the photo and the exact same instruction it learned from, it says, "Oh, I know this! I drew this!"

But here's the catch: In the real world, you often only have the photo. You don't have the secret recipe. The artist who trained the machine never told you what words they used.

If you try to guess the recipe using a smart AI (a Vision-Language Model) that looks at the photo and writes a description, it usually gets it close, but not exact. It might say, "A dog with a hat."

The Result: The machine gets confused. It doesn't react strongly to your "guess" recipe, whether the photo is from its secret album or not. The test fails because the signal is too weak.

The Solution: MOFIT (The "Overfitting" Trick)

The authors of this paper, MOFIT, came up with a clever workaround. Instead of trying to guess the real recipe, they create a fake, super-specific recipe that is perfectly tuned to the machine's brain, even if it doesn't match the photo perfectly.

Here is how they do it, step-by-step:

Step 1: The "Chameleon" Photo (Surrogate Optimization)

Imagine you have a photo of a cat. You want to know if the machine memorized it.
Instead of asking the machine to describe the cat, you take the photo and start tweaking it slightly (adding tiny, invisible noise). You keep tweaking it until the machine looks at it and thinks, "Wow, this looks exactly like something I've seen before!"

You aren't trying to make the photo look better; you are trying to make the photo fit perfectly into the machine's memory. You create a "Chameleon Photo" that the machine loves.

Step 2: The "Perfect" Recipe (Embedding Extraction)

Now that you have this "Chameleon Photo" that the machine loves, you ask the machine: "What words would you use to describe this specific Chameleon Photo?"

The machine spits out a super-specific text embedding (a digital recipe) that is perfectly matched to that Chameleon Photo. Let's call this the "MOFIT Recipe."

Step 3: The Trap (The Mismatch)

Here is the magic trick. You take the MOFIT Recipe (which was made for the Chameleon Photo) and you feed it to the machine along with your Original Photo.

If the Original Photo is a "Member" (from the secret album): The machine's brain is wired to be very sensitive to its own training data. When you give it a recipe that is almost right but slightly off (because it was made for the Chameleon, not the Original), the machine gets very stressed. It screams, "This doesn't match my memory!" Its internal error score goes way up.
If the Original Photo is a "Non-Member" (new to the machine): The machine doesn't care as much. It's not used to these specific photos, so a slightly mismatched recipe doesn't bother it much. Its error score stays low.

The Analogy: The Strict Chef vs. The Casual Cook

Think of the machine as a Strict Chef who memorized a specific cookbook.

The Old Way (Guessing the Recipe): You show the Chef a dish and ask, "Did you make this?" You guess the recipe is "Spicy Chicken." The Chef says, "Maybe, maybe not." (Low accuracy).
The MOFIT Way:
- You take the dish and tweak it until it looks exactly like a dish the Chef memorized.
- You ask the Chef, "What is the name of this tweaked dish?" He writes down a very specific, complex name: "Spicy Chicken with a pinch of saffron and a hint of lemon."
- Now, you show him the Original Dish (which is just "Spicy Chicken") but tell him the Complex Name.
- If he memorized the dish: He panics! "Wait, my memory says it needs saffron! This is wrong!" (High Stress = MEMBER).
- If he didn't memorize it: He shrugs. "I don't know what that is, but I'll eat it." (Low Stress = NON-MEMBER).

Why This Matters

Privacy: This proves that even without the secret training data (the captions), hackers can still figure out if a specific person's photo was used to train an AI.
Better than Guessing: The paper shows that this "Chameleon" trick works much better than just using a smart AI to guess the caption. In fact, on some tests, it worked even better than methods that did have the secret captions!
The Warning: It tells AI developers that they need to be more careful. Just because you hide the text descriptions doesn't mean the images are safe from being "sniffed out."

In short: MOFIT tricks the AI into revealing its secrets by creating a perfect "fake match" and seeing how the AI reacts when that match is applied to the real photo. If the AI freaks out, it was probably there all along.

1. Problem Statement

Context: Latent Diffusion Models (LDMs) have achieved high-fidelity text-to-image generation but are prone to memorizing training data, posing privacy and intellectual property risks. Membership Inference Attacks (MIAs) are used to detect if a specific image was part of a model's training set.

The Gap: Existing state-of-the-art MIAs for LDMs (e.g., CLiD) rely heavily on access to the ground-truth text captions paired with the query images during training.

Real-world Constraint: In practical scenarios (e.g., auditing a public AI model), an adversary typically only has access to the generated image, not the original training captions.
Failure of Current Methods: When ground-truth captions are replaced with captions generated by Vision-Language Models (VLMs), the performance of existing MIAs degrades significantly. VLMs cannot perfectly replicate the specific conditioning signals used during training, leading to a loss of separability between member (training) and non-member (hold-out) samples.

Goal: Develop a robust MIA framework that operates effectively in a caption-free setting, where only the query image is available, without relying on VLM-generated captions.

2. Methodology: MOFIT

The authors propose MOFIT (Model-Fitted), a two-stage framework that constructs synthetic conditioning inputs explicitly overfitted to the target model's generative manifold.

Core Insight

The authors observe a systematic difference in how member and non-member samples respond to mismatched conditioning:

Member Samples: Highly sensitive to conditioning changes. When the ground-truth caption is replaced with an alternative (even a VLM one), the conditional denoising loss ( $L_{cond}$ ) increases significantly.
Hold-out Samples: Relatively insensitive. Their $L_{cond}$ remains stable regardless of the conditioning input.
Observation: This asymmetry can be exploited to create a strong separation signal even without ground-truth captions.

The Two-Stage Process

Given a query image $x_0$ , MOFIT proceeds as follows:

Stage 1: Model-Fitted Surrogate Optimization

The goal is to create a "surrogate" image $x^*_0$ that is explicitly overfitted to the target model's unconditional prior.
A perturbation $\delta$ is added to the query image ( $x'_0 = x_0 + \delta$ ).
$\delta$ is optimized to minimize the unconditional denoising loss ( $L_{uncond}$ ) using a null token embedding ( $\phi_{null}$ ).
Result: The resulting surrogate $x^*_0$ lies deeply within the model's learned data manifold, appearing "natural" to the model even without a text prompt.

Stage 2: Surrogate-Driven Embedding Extraction

From the optimized surrogate $x^*_0$ , the framework extracts a model-fitted embedding $\phi^*$ .
$\phi^*$ is optimized to minimize the conditional denoising loss ( $L_{cond}$ ) for the surrogate $x^*_0$ .
Result: $\phi^*$ is a text embedding that is perfectly aligned with the overfitted surrogate $x^*_0$ within the model's conditioning space.

Inference (Membership Decision)

The original query image $x_0$ is conditioned on the extracted embedding $\phi^*$ .
Mismatch Effect: Since $\phi^*$ $ϕ^{*}$ is tailored to $x^*_0$ $x_{0}^{*}$ (the overfitted surrogate) and not $x_0$ $x_{0}$ , a deliberate mismatch is created.
- Members: Being sensitive to conditioning, they exhibit a sharp increase in $L_{cond}$ when conditioned on $\phi^*$ .
- Hold-outs: Being less sensitive, their $L_{cond}$ changes minimally.
Score Calculation: The membership score is derived from the difference between the conditional loss (using $\phi^*$ ) and the unconditional loss ( $L_{uncond}$ ). This difference is significantly larger for members, restoring high separability.

3. Key Contributions

First Caption-Free MIA Framework: MOFIT is the first method designed specifically to perform effective membership inference against LDMs without access to ground-truth captions, addressing a critical practical limitation.
Novel Empirical Insight: The paper identifies that member samples exhibit high sensitivity to conditioning mismatches (increased $L_{cond}$ ), while hold-out samples do not. This sensitivity is the key signal for separation.
Two-Stage Optimization Strategy: By synthesizing a surrogate image and then extracting a tightly coupled embedding, MOFIT creates a "perfect mismatch" that amplifies the loss disparity between members and non-members.
Superior Performance: MOFIT outperforms all prior methods that rely on VLM-generated captions and, in several cases, even surpasses methods that have access to ground-truth captions.

4. Experimental Results

The authors evaluated MOFIT on three fine-tuned Stable Diffusion models (Pokemon, MS-COCO, Flickr) and the pre-trained Stable Diffusion v1.5.

Performance vs. VLM-Baselines:
- On the Pokemon dataset, MOFIT achieved an Attack Success Rate (ASR) of 94.48% compared to 72.27% for the VLM-conditioned CLiD baseline.
- On MS-COCO, MOFIT reached 88.00% ASR, significantly outperforming the VLM-baseline (63.70%) and even surpassing the ground-truth caption baseline (86.50%).
- TPR@1%FPR improvements were substantial, with MOFIT showing +25% to +47% gains over VLM-based baselines.
Generalization:
- MOFIT maintained strong performance on Stable Diffusion v1.5, v2.1, and v3, demonstrating robustness across different model architectures and scales.
- It also generalized to specialized domains, such as medical imaging (Prompt2MedImage), where VLMs struggle to generate accurate captions.
Ablation Studies:
- Using the raw query image ( $x_0$ ) instead of the optimized surrogate ( $x^*_0$ ) for embedding extraction resulted in significantly lower performance, proving the necessity of the overfitting step.
- Random noise perturbations were less effective than the model-fitted surrogate.
Efficiency:
- The main limitation is runtime (~7-9 minutes per image). The authors propose an early-stopping strategy that reduces runtime by ~80% while maintaining competitive performance.

5. Significance and Implications

Practical Threat Model: MOFIT shifts the threat model from a theoretical "ideal" (where attackers have training captions) to a realistic scenario where attackers only have images. This makes the privacy risks of LDMs more immediate and tangible.
Defense Implications: The success of MOFIT highlights that current LDMs are highly vulnerable to memorization even when text supervision is hidden. It underscores the need for stronger privacy-preserving training techniques (e.g., differential privacy, better regularization) and suggests that LoRA (Low-Rank Adaptation) may offer some defense by reducing memorization capacity.
Methodological Shift: The paper demonstrates that "overfitting" a surrogate to the model's manifold can be a powerful tool for extracting signals about the model's training history, offering a new direction for privacy auditing in generative AI.

In conclusion, MOFIT effectively bridges the gap between theoretical MIA capabilities and real-world constraints, proving that ground-truth captions are not strictly necessary to audit the privacy risks of diffusion models.