Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection

Imagine you are a detective trying to solve a mystery: Did someone steal a photo and edit it to look like their own?

In the world of digital images, "Copy Detection" is the job of spotting these stolen, edited photos. For a long time, computers were good at finding exact duplicates (like a photocopy of a photocopy), but they struggled when the thief made clever changes—like cropping a picture, changing the colors, or even cutting out a specific object and pasting it somewhere else.

This paper introduces a new detective team called PixTrace and CopyNCE. Here is how they work, explained simply:

1. The Problem: The "Blind" Detective

Previous AI detectives tried to solve this by looking at the whole picture at once. They would say, "These two photos look 80% similar, so they must be related."

The Flaw: If a thief takes a photo of a cat, crops out the tail, and adds a hat, the "whole picture" approach gets confused. It might miss the connection because the overall look changed too much.
The Old Way (Heuristic Matching): Some older methods tried to look at small patches (like puzzle pieces) and guess which ones matched. But they were like a detective guessing, "This patch looks kinda like that one," which led to a lot of false alarms (accusing innocent photos) or missed clues.

2. The Solution: The "Digital Paper Trail" (PixTrace)

The authors realized something brilliant: Every edit leaves a trace.

Imagine you have a transparent sheet of paper with a drawing on it (the original image).

If you stretch the paper, the drawing stretches.
If you cut a piece out, you know exactly where that piece came from.
If you rotate it, you know the new angle.

PixTrace is a system that keeps a digital logbook of every single pixel.

When the AI "edits" an image to create a training example, it doesn't just save the new picture. It saves a map that says: "Pixel #500 in the new image came from Pixel #500 in the original image."
Even if the image is rotated, zoomed, or color-shifted, this logbook knows exactly where every pixel traveled. It's like having a GPS tracker for every single dot of color in the image.

3. The New Training Method: The "Strict Teacher" (CopyNCE)

Now that they have this perfect map (PixTrace), they needed a way to teach the AI to use it. Enter CopyNCE.

Think of the AI as a student taking a test.

Old Teachers: Would show the student two photos and say, "These are similar." But they didn't explain why or which parts matched. The student would guess, often getting it wrong.
The CopyNCE Teacher: Uses the PixTrace logbook to be incredibly specific. It points to a patch on the "stolen" image and says, "Look! This specific patch here corresponds to this specific patch there. They overlap by 40%."

It forces the AI to learn the geometry of the theft. It teaches the AI: "Don't just guess that the whole image is similar. Prove that the specific pieces fit together like a puzzle."

4. The Result: A Super-Detective

Because the AI learned to track the "footprints" of pixels, it became incredibly good at spotting even the most sophisticated edits.

Performance: On a major test (the DISC21 competition), this new method became the world champion. It found stolen images that other methods missed.
Interpretability: Unlike other "black box" AI models that just give a "Yes/No" answer, this system can actually show you where the copy happened. It can highlight the specific squirrel in a photo that was copied from another image, proving it knows exactly what it's looking at.

Summary Analogy

Imagine trying to find a specific page torn out of a book and pasted into another book.

Old AI: Looks at the two books and says, "They look similar enough." (Often wrong).
PixTrace + CopyNCE: Keeps a record of every page number. It sees the torn page, tracks its page number, and says, "Aha! Page 42 in Book B is definitely Page 42 from Book A, even though the ink color changed."

By following the "footprints" of the pixels, this new method makes it nearly impossible for image thieves to hide their edits.

1. Problem Statement

Image Copy Detection (ICD) aims to identify manipulated content by distinguishing between original images and their edited copies (exact duplicates, near-duplicates, or heavily edited versions). While Self-Supervised Learning (SSL) has become the dominant paradigm for training ICD models, current state-of-the-art (SOTA) methods suffer from significant limitations:

View-Level vs. Fine-Grained: Existing SSL approaches primarily rely on view-level contrastive learning, neglecting correspondences at the region or patch level.
Noisy Supervision: Methods attempting to learn fine-grained correspondences often use heuristic strategies (e.g., Nearest Neighbor matching based on features or patch centroids). These approaches are prone to false matches (treating negative samples as positive) and partial matches (missing positive samples), introducing conflicting gradient signals that hinder model convergence.
Lack of Geometric Traceability: Current methods fail to leverage the inherent geometric "traceability" of pixels in edited content, where pixel coordinates can be deterministically mapped back to the original image through transformation functions.

2. Methodology

The authors propose a framework that bridges pixel-level traceability with patch-level similarity learning through two core innovations: PixTrace and CopyNCE.

A. PixTrace: Pixel Coordinate Tracking

PixTrace is a pipeline designed to maintain explicit spatial mappings across editing transformations.

Mechanism: It utilizes a Coordinate Table ( $T$ ) initialized such that every pixel maps to itself ( $T[m,n] = [m,n]$ ).
Transformation: As a sequence of edits (e.g., affine transforms, perspective changes, image matting, color jitter) is applied to an original image ( $I_o$ ) to generate a copy ( $I_a$ ), the corresponding transformation functions are sequentially applied to the coordinate table.
Result: The table $T_{ao}$ maps every pixel in the copy image back to its origin in the original image. Conversely, $T_{oa}$ (the reverse table) maps original pixels to the copy.
Cross-Image Tracking: By using the original image as a bridge, PixTrace can track correspondences between two different edited copies ( $I_a$ and $I_b$ ) derived from the same source, calculating the overlap ratio between any two patches with mathematical precision.

B. CopyNCE: Geometrically-Guided Contrastive Loss

CopyNCE is a novel contrastive loss function that regularizes patch affinity using the precise overlap ratios derived from PixTrace.

Decomposition: Instead of treating the whole image as a single unit, the method decomposes the query and reference regions into minimal unit patches (e.g., $16 \times 16$ tokens in ViT).
Prior Distribution: Unlike standard InfoNCE which assumes a single positive sample, CopyNCE acknowledges that a query patch may correspond to multiple reference patches with varying degrees of overlap. It defines a prior target distribution $q(R^r_j | R^q_i)$ based on the pixel overlap ratio calculated by PixTrace.
Loss Formulation: The loss minimizes the Kullback-Leibler (KL) divergence between the model's predicted affinity distribution and the geometrically-grounded prior distribution.
- It incorporates a confidence sharpening parameter ( $\gamma$ ) to modulate the certainty of the prior.
- It constructs a noise set containing all patches from the reference image, including "hard negatives" (spatially adjacent patches with high visual similarity), to force the model to learn discriminative features.
Symmetry: The final loss is symmetric, applying the logic in both directions (Query $\to$ Reference and Reference $\to$ Query).

C. Model Architecture

The framework supports two model types, both based on Vision Transformers (ViT):

Descriptor: Extracts features for global image retrieval. CopyNCE is applied as an auxiliary loss to regularize patch tokens.
Matcher: Takes image pairs as input for binary classification. It uses an encoder-fusion architecture where CopyNCE supervises the fused tokens to enhance local matching capabilities.

3. Key Contributions

PixTrace Pipeline: A comprehensive coordinate mapping system that eliminates ambiguous supervision by providing exact pixel-wise correspondences between edited images, overcoming the noise inherent in heuristic nearest-neighbor methods.
CopyNCE Loss: A geometrically-guided contrastive loss that translates pixel-level traceability into patch-level supervision. It regularizes patch affinity using overlap ratios, effectively suppressing noise from non-corresponding areas.
State-of-the-Art Performance: The method achieves SOTA results on the DISC21 dataset, outperforming existing competition solutions and baselines in both matcher and descriptor tasks.
Enhanced Interpretability: The method provides clear visualizations of copy regions through affinity heatmaps and entropy analysis, demonstrating that the model correctly identifies edited areas rather than relying on semantic hallucinations.

4. Experimental Results

Experiments were conducted on the DISC21 dataset (NeurIPS 2021 Image Similarity Challenge) and the more challenging NDEC dataset.

Matcher Performance:
- Achieved 88.7% $\mu$ AP and 83.9% RP90 on the DISC21 test set (using ViT-S with 336 $\times$ 336 resolution).
- This surpasses the previous SOTA (D2LV) by 0.1% $\mu$ AP and 3.8% RP90, despite D2LV using an ensemble of 33 models.
- Without ensembling, CopyNCE still significantly outperforms separate ViT-B baselines.
Descriptor Performance:
- Achieved 72.6% $\mu$ AP and 68.4% RP90 on DISC21.
- Outperforms other SOTA descriptors (e.g., SSCD, Lyakaap) by significant margins (e.g., +5.3% RP90 over SSCD under similar settings).
Generalization:
- The method showed strong generalization on the AnyPattern dataset (unseen copy edits) and VSC2022 (video copy detection), proving its robustness against aggressive and complex transformations.
Ablation Studies:
- Replacing PixTrace with heuristic methods (FeatNN, LocNN) resulted in significant performance drops, confirming the necessity of precise geometric supervision.
- The use of Global Hard Negative Mining (GHNM) was shown to be critical for descriptor performance.

5. Significance and Impact

Bridging the Gap: This work successfully bridges the gap between pixel-level geometric traceability and deep learning-based feature representation, a gap that previous SSL methods failed to address due to reliance on noisy heuristics.
Interpretability: By leveraging explicit coordinate mappings, the model's decision-making process becomes more interpretable. The affinity heatmaps clearly highlight the specific edited regions, addressing the "black box" nature of many deep learning copy detectors.
Efficiency vs. Accuracy: While the method utilizes complex training pipelines (coordinate tracking), it achieves superior results with relatively efficient architectures (ViT-S) compared to massive ensembles used by competitors.
Future Direction: The paper highlights that while Local Crop Ensembling (LCE) boosts performance, it is computationally expensive. The core contribution (CopyNCE) remains effective even without LCE, suggesting a path toward efficient, high-accuracy copy detection in real-world applications.

In conclusion, CopyNCE represents a paradigm shift in Image Copy Detection by moving from heuristic, noise-prone correspondence learning to a deterministic, geometry-aware framework, setting a new benchmark for both performance and interpretability in the field.