FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

Imagine you are a detective trying to spot a fake painting in a museum. For years, your tools have been good at looking at the colors and shapes (the "spatial" view) to see if something looks weird. But as forgers get smarter—using powerful AI to create perfect fakes—your old tools are starting to fail. The fake paintings look so real that the colors match perfectly, but the texture of the paint or the vibrations in the brushstrokes might be slightly off.

This paper introduces a new detective tool called FOCA (Frequency-Oriented Cross-Domain Forgery Detection). Here is how it works, explained simply:

1. The Problem: The "Too Perfect" Fake

Traditional image detectors are like art critics who only look at the picture from a distance. They check if the story makes sense (e.g., "Does that cat have six legs?"). But modern AI can fix those obvious mistakes.

The Flaw: Old detectors ignore the "invisible" clues. They don't look at the frequency of the image—the tiny, high-pitched details like noise, compression artifacts, or weird texture patterns that human eyes can't see but cameras can.
The Result: Old detectors get tricked easily, and even when they spot a fake, they can't explain why in a way humans understand.

2. The Solution: FOCA's "Super-Vision"

FOCA is like giving your detective a pair of magic glasses that let them see two worlds at once:

The RGB World: The normal picture you see (colors and shapes).
The Frequency World: A hidden layer showing the "vibrations" and fine textures of the image.

How it works (The "Cross-Domain Fusion"):
Imagine you are listening to a song.

RGB is the melody (the main tune).
Frequency is the background static or the specific quality of the instruments.
FOCA's Secret Sauce (FAF Module): It uses a special "mixing board" (Cross-Attention) to listen to the melody and the static at the same time. If the melody sounds perfect but the static sounds like it was recorded in a different room, FOCA knows it's a fake. It fuses these two views so the AI can say, "This looks real, but the texture here is suspicious."

3. The "Brain" Upgrade: Talking to the Image

Most detectors just give you a red box around the fake part and say "Fake." FOCA is different because it's built on a Multimodal Large Language Model (MLLM).

Think of it as a Detective with a Voice: Instead of just pointing, FOCA can talk to you.
The Output: It doesn't just say "Tampered." It says: "This image is tampered. The grass in the bottom left corner looks unnatural because the frequency patterns are too smooth, which suggests it was AI-generated."
It uses special tokens (like [SEG] and [CLS]) to act as a highlighter pen, drawing the exact fake area while writing a report on why.

4. The Training Ground: FSE-Set

To teach this detective, the researchers built a massive new training school called FSE-Set.

The Curriculum: It contains 100,000 images (50,000 real, 50,000 fake).
The Twist: Unlike old schools that just showed pictures, this school shows the pictures and their "frequency fingerprints." It also teaches the AI to write explanations in English, not just math.
The Teachers: They used advanced AI tools (like Stable Diffusion and Language-SAM) to create realistic fakes and then used another AI (Claude) to write the "answer keys" explaining exactly what was wrong in both the visual and frequency domains.

5. The Results: Why It Wins

When they put FOCA to the test against other top detectives:

Accuracy: It caught more fakes than anyone else (96.2% accuracy).
Precision: It drew the "fake" lines more accurately, pinpointing the exact pixels.
Explainability: This is the big win. While other models just gave a score, FOCA gave a human-readable explanation. It could tell you where the forgery was and why the frequency domain gave it away.

The Big Picture

FOCA is a game-changer because it stops relying only on what the image looks like and starts analyzing what the image feels like (in terms of data patterns). By combining a powerful AI brain with a "frequency microscope," it can spot the most sophisticated digital forgeries and explain them to us in plain English, helping us trust what we see in the digital world again.

1. Problem Statement

The rapid advancement of generative AI models has led to highly realistic image tampering, posing severe challenges to media verification and digital forensics. Existing Image Forgery Detection and Localization (IFDL) methods face two critical limitations:

Over-reliance on Semantic Content: Most methods focus on RGB spatial features or high-level semantics, often neglecting subtle low-level textural cues and high-frequency artifacts left by tampering.
Limited Interpretability: Traditional models typically output only detection scores or binary masks without providing human-interpretable explanations regarding why an image is fake or where the manipulation occurred in the frequency domain.
Gap in MLLMs: While Multimodal Large Language Models (MLLMs) offer strong semantic reasoning, current implementations operate exclusively in the RGB domain, missing forensic traces invisible to the naked eye.

2. Methodology: The FOCA Framework

The authors propose FOCA, a framework that integrates semantic reasoning with frequency-domain forensic cues to achieve detection, localization, and explanation simultaneously.

A. Architecture Overview

FOCA takes an input image ( $x_{img}$ ) and a text instruction ( $x_{txt}$ ) to generate three outputs:

Detection Result ( $\hat{D}$ ): A classification of Real vs. Tampered.
Localization Mask ( $\hat{M}$ ): A pixel-level segmentation of the tampered area.
Textual Explanation ( $\hat{T}$ ): A natural language description of the artifacts.

The architecture consists of three main components:

Frequency Attention Fusion (FAF) Module:
- Feature Extraction: Uses Discrete Wavelet Transform (DWT) to decompose the input image into four sub-bands ( $x_{LL}, x_{LH}, x_{HL}, x_{HH}$ ). The high-frequency $x_{HH}$ sub-band is selected as it effectively reveals subtle tampering artifacts.
- Cross-Attention Fusion: The $x_{HH}$ sub-band acts as the Query, while the original RGB image acts as the Key and Value. This allows the model to dynamically retrieve structurally relevant regions from the spatial domain guided by high-frequency cues.
- Residual Connection: The attended features are fused with the original image via a residual connection to preserve low/mid-frequency information while amplifying tampering-sensitive details.
- Contrastive Learning: An auxiliary contrastive loss (InfoNCE) is applied to the fused features to enforce discriminative, tampering-aware representations.
MLLM Backbone:
- Based on LISA-7B, the model is extended with two special tokens: [CLS] for detection classification and [SEG] for segmentation.
- The model is fine-tuned using LoRA (Low-Rank Adaptation) to update only a small number of parameters, ensuring efficiency while retaining pre-trained knowledge.
Segmentation Module:
- Uses a frozen image encoder (SAM) and a decoder to generate the pixel-level mask ( $\hat{M}$ ) based on the [SEG] token embeddings.

B. Training Objectives

The model is optimized using a joint loss function:
$L = L_{pred} + \lambda_c L_{cl}$

Prediction Loss ( $L_{pred}$ ): Combines Cross-Entropy for text generation ( $L_t$ ), classification ( $L_{cls}$ ), and a composite loss (Binary Cross-Entropy + Dice Loss) for mask generation ( $L_{mask}$ ).
Contrastive Loss ( $L_{cl}$ ): Maximizes agreement between positive pairs (same image) and minimizes similarity with negatives to enhance feature discriminability.

3. Key Contributions

1. The FOCA Framework

FOCA is the first MLLM-based framework to integrate semantic reasoning with frequency-domain forensic cues. By fusing RGB and wavelet-frequency features via cross-attention, it detects subtle inconsistencies that pure spatial models miss, while providing explicit, human-readable explanations for both domains.

2. The FSE-Set Dataset

The authors constructed FSE-Set, a large-scale dataset containing:

100,000 images: 50K authentic (from ImageNet) and 50K tampered (from COCO).
Diverse Manipulations: Includes 25K traditional edits (splicing, copy-move) and 25K AI-generated edits (using Stable Diffusion inpainting).
Dual-Domain Annotations: Unlike existing datasets, FSE-Set provides pixel-level masks and textual explanations for both the RGB image and its HH frequency sub-band, facilitating explainable cross-domain analysis.

3. Comprehensive Evaluation

The paper demonstrates that FOCA outperforms state-of-the-art (SOTA) methods in detection accuracy, localization precision, and the quality of generated explanations.

4. Experimental Results

Detection Performance

vs. Traditional Methods: On FSE-Set, FOCA achieved an Overall Accuracy of 96.2% and F1 Score of 96.2%, outperforming methods like CnnSpott, Fusing, and DRCT. Notably, it showed superior balance between Real and Tampered class detection compared to Fusing (which had higher Real accuracy but lower Tampered detection).
vs. MLLM-based Methods: FOCA significantly outperformed other MLLMs (LISA, Qwen, InternVL3, SIDA). It achieved 96.2% Accuracy/F1, surpassing the closest competitor, SIDA (95.6%), by effectively leveraging frequency-domain information.

Localization Performance

FOCA achieved SOTA results on the FSE-Set and Columbia datasets.
On FSE-Set, it improved IoU by +0.7 and F1 by +0.7 over SIDA.
The FAF module was identified as the key driver for this improvement, enabling precise region-level localization by associating semantic inconsistencies with high-frequency traces.

Explanation Quality

Evaluated using ROUGE-L, Cosine Similarity (CSS), and an LLM-as-a-Judge (GPT-4o) scoring system.
FOCA achieved the highest scores across all metrics, demonstrating its ability to generate high-quality, natural language explanations that accurately describe tampering artifacts in both spatial and frequency contexts.

5. Significance

Bridging the Gap: FOCA successfully bridges the gap between high-level semantic reasoning (MLLMs) and low-level forensic analysis (Frequency Domain), addressing the "black box" nature of current forgery detectors.
Interpretability: By providing human-interpretable explanations for why an image is fake (e.g., "irregularities in the high-frequency edge details"), it enhances trust and usability in digital forensics.
Resource Creation: The introduction of FSE-Set fills a critical void in the community by providing a dataset with dual-domain annotations, enabling future research into explainable AI for image forensics.
Robustness: The framework's ability to handle both traditional manipulations and advanced AI-generated edits makes it a robust solution for the evolving landscape of media misinformation.