No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models

Imagine you are a teacher trying to figure out if a student cheated on a test. You have two different ways to catch them:

The "Copycat" Test (CDD): You ask the student to solve the same problem 50 times. If they are cheating, they will memorize the exact answer and write it down identically every single time, even if you tell them to try to be creative.
The "Familiarity" Test (Perplexity/Min-k%): You just look at how the student thinks about the problem. Even if they don't write the exact same answer every time, their brain reacts to the question with a strange sense of "Oh, I've seen this before!" They might stumble less or use specific words they've memorized.

This paper is about testing these two methods on smaller, smarter AI models (like a student with a smaller brain) to see which one actually works.

The Big Discovery: The "Silent Cheater"

The researchers found a major problem with the "Copycat" test (called CDD in the paper).

They discovered that CDD only works if the student has rote memorized the answer like a parrot. If the student has actually learned the concept but hasn't memorized the exact words, CDD fails completely.

Here is the analogy:

The Scenario: Imagine a student is given a math problem 10 times during study.
The "Full Memorization" (Large Models/Heavy Training): The student writes the exact same solution 10 times. If you ask them to solve it again, they write it exactly the same way every time. CDD catches this.
The "Smart Learning" (Small Models/Light Training): The student understands the math. When you ask them to solve it 10 times, they get the right answer, but they phrase it slightly differently each time. They might say "18 minus 9" one time and "half of 18" the next.
- The Problem: Because the answers are different, the "Copycat" test (CDD) thinks, "Oh, they are being creative! They didn't cheat!"
- The Reality: They did cheat (they saw the problem before), but they are smart enough to vary their answer. CDD misses this entirely.

Why Does This Happen?

The researchers tested this on small AI models (ranging from 70 million to 410 million "brain cells," or parameters). They found that:

Small Brains + Light Training = No Parrot Effect: When you use a small model and don't train it too hard (using a method called LoRA, which is like giving the student a tiny cheat sheet instead of rewriting their whole brain), the model learns the pattern but doesn't freeze the exact words. It stays flexible.
The "Threshold": There is a tipping point. If you train the model hard enough or make it big enough, it stops being flexible and starts acting like a parrot (memorizing). Only then does the "Copycat" test work.
The Blind Spot: In the real world, we often use small models with light training to save money and time. In this "sweet spot," the "Copycat" test is useless. It gives you a false sense of security, telling you the data is clean when it's actually contaminated.

The Better Solution: The "Familiarity" Test

The paper shows that the other methods (Perplexity and Min-k% Prob) are much better detectives.

How they work: Instead of waiting for the student to write the exact same sentence 50 times, these methods look at the internal confidence of the model.
The Analogy: Even if the student writes a different sentence, their brain still feels a "spark of recognition" when they see the question. They don't have to memorize the answer to feel familiar with the question.
The Result: These methods caught the cheating in every single case, even when the model was being flexible and creative. They work whether the model is a parrot or a genius.

The Takeaway for Everyone

If you are trying to check if an AI has been trained on data it shouldn't have seen (like a test question):

Don't rely on the "Copycat" test (CDD) if you are using small models or light training. It will likely tell you everything is fine when it's not. It's like checking for cheating by only looking for students who write their answers in the exact same handwriting.
Use the "Familiarity" test instead. It's more subtle, but it catches the "smart cheaters" who learn the material without memorizing the script.

In short: The "Copycat" test only catches the students who are too lazy to think. The "Familiarity" test catches the students who are smart enough to cheat without getting caught. For small AI models, you need the second one.

Here is a detailed technical summary of the paper "No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models" by Omer Sela.

1. Problem Statement

Data Contamination occurs when evaluation benchmarks (e.g., GSM8K, HumanEval) are inadvertently included in a model's training set. This undermines the validity of benchmark results.

The Challenge: As models are trained on massive, opaque corpora, detecting contamination without access to the training data is critical.
The Specific Gap: A recent method, CDD (Contamination Detection via output Distribution), claims to detect contamination by measuring the "peakedness" of a model's output distribution (i.e., whether a model produces identical outputs when sampled repeatedly on the same prompt). While CDD showed success on large 7B-parameter models, its effectiveness on Small Language Models (SLMs) (70M–410M parameters) and under Parameter-Efficient Fine-Tuning (PEFT) regimes (like LoRA) remains unverified.
Hypothesis: The authors investigate whether CDD's reliance on "output distribution collapse" (memorization) makes it blind to contamination in scenarios where models learn from data without strictly memorizing it.

2. Methodology

The study employs controlled contamination experiments using the Pythia model suite (70M, 160M, and 410M parameters).

Experimental Design

Datasets: GSM8K (Math), HumanEval (Code), and MATH (Competition Math).
Contamination Injection: Test examples were repeated 0, 1, 5, or 10 times and added to the training set.
Fine-Tuning Variables:
- Model Size: 70M, 160M, 410M.
- Capacity: Full Fine-Tuning (100% params), LoRA rank 256 (~~4-6% params), LoRA rank 8 (~~0.1-0.2% params).
- Duration: 3 epochs and 20 epochs.
Detection Methods Compared:
1. CDD: Measures the fraction of temperature-sampled outputs ( $t=0.8$ ) that are within a small edit distance of the greedy output ( $t=0$ ). High "peakedness" implies memorization.
2. N-gram Overlap: Baseline requiring training corpus access (Ground Truth).
3. Perplexity (PPL): Lower perplexity on test prompts indicates prior exposure.
4. Min-k% Prob: Flags text where the lowest-probability tokens have unusually high probabilities (indicating familiarity).

3. Key Contributions & Findings

A. The "Memorization Threshold"

The central finding is that CDD only works if fine-tuning produces verbatim memorization.

Failure Mode: In low-capacity settings (e.g., LoRA rank 8, 3 epochs), models learn the data (loss decreases) but do not collapse their output distribution. They produce diverse, correct(ish) answers upon sampling. Consequently, CDD accuracy remains at chance level (~50%), even though the data is verifiably contaminated.
Success Mode: CDD accuracy jumps sharply to >90% only when the training capacity (model size + LoRA rank + epochs) crosses a specific threshold that forces the model to converge to a single output sequence.

B. Superiority of Probability-Based Methods

Probability-based methods (Perplexity and Min-k% Prob) consistently outperform CDD across all conditions.

Sensitivity: These methods detect contamination even when the model has learned the data but not memorized it.
Low-Level Contamination: In realistic scenarios with single-repetition contamination ( $c=1$ ), CDD performs at chance, while Perplexity and Min-k% Prob show significant detection signals.
Data: Across 27 experimental conditions, CDD exceeded chance in only 5, whereas Perplexity and Min-k% Prob exceeded chance in 24 and 25, respectively.

C. Training Loss is Not a Proxy for Detectability

The study reveals a disconnect between learning and detectability:

Models can achieve low training loss (indicating they learned the task) while CDD remains at chance.
CDD accuracy only rises when training loss approaches zero, indicating a complete collapse of the output distribution. There is no "intermediate" zone where CDD partially works.

D. Hyperparameter Robustness

Ablation studies on CDD hyperparameters (peakedness threshold $\alpha$ , temperature $t$ , sample count $n$ ) confirmed that tuning cannot rescue CDD in low-capacity regimes. If the model hasn't memorized the data, no hyperparameter setting will generate a peaked distribution.

4. Results Summary

Small Models + PEFT: In the most common modern adaptation scenario (Small LMs + LoRA), CDD is effectively useless for contamination detection.
Scale Matters: Larger models (410M) with higher LoRA ranks (256) or full fine-tuning eventually cross the memorization threshold, allowing CDD to work. However, even in these cases, probability-based methods detect contamination earlier and more reliably.
Domain Agnosticism: The failure of CDD and the success of probability-based methods were consistent across Math (GSM8K, MATH) and Code (HumanEval) domains.

5. Significance and Implications

Practical Blind Spot: The paper identifies a "silent failure mode" for output-distribution-based detection. As the community shifts toward parameter-efficient fine-tuning (LoRA) for small models, CDD provides false assurance that models are clean when they are actually contaminated.
Methodological Shift: The authors argue that probability-based methods (Perplexity, Min-k% Prob) are superior for auditing small language models because they detect "familiarity" with data, not just "verbatim memorization."
Reconciliation with Prior Work: The results explain why CDD worked on 7B models (where even LoRA rank 8 provides millions of trainable parameters, sufficient for memorization) but fails on 70M–410M models (where the same rank provides negligible capacity). The critical factor is the absolute number of trainable parameters, not the rank itself.

Conclusion

The paper concludes that CDD is insufficient for contamination detection in small language models, particularly under parameter-efficient fine-tuning. Detection relies on a sharp "memorization threshold" that is often not met in practical SLM adaptation. The authors recommend using probability-based methods (Perplexity, Min-k% Prob) as the standard for auditing models at this scale.