FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

This paper proposes FOCA, a multimodal large language model framework that integrates RGB spatial and frequency domain features via cross-attention to achieve accurate image forgery detection, localization, and human-interpretable explanations, supported by a new large-scale dataset called FSE-Set.

Zhou Liu, Tonghua Su, Hongshi Zhang, Fuxiang Yang, Donglin Di, Yang Song, Lei Fan

Published 2026-02-24
📖 4 min read☕ Coffee break read

Imagine you are a detective trying to spot a fake painting in a museum. For years, your tools have been good at looking at the colors and shapes (the "spatial" view) to see if something looks weird. But as forgers get smarter—using powerful AI to create perfect fakes—your old tools are starting to fail. The fake paintings look so real that the colors match perfectly, but the texture of the paint or the vibrations in the brushstrokes might be slightly off.

This paper introduces a new detective tool called FOCA (Frequency-Oriented Cross-Domain Forgery Detection). Here is how it works, explained simply:

1. The Problem: The "Too Perfect" Fake

Traditional image detectors are like art critics who only look at the picture from a distance. They check if the story makes sense (e.g., "Does that cat have six legs?"). But modern AI can fix those obvious mistakes.

  • The Flaw: Old detectors ignore the "invisible" clues. They don't look at the frequency of the image—the tiny, high-pitched details like noise, compression artifacts, or weird texture patterns that human eyes can't see but cameras can.
  • The Result: Old detectors get tricked easily, and even when they spot a fake, they can't explain why in a way humans understand.

2. The Solution: FOCA's "Super-Vision"

FOCA is like giving your detective a pair of magic glasses that let them see two worlds at once:

  1. The RGB World: The normal picture you see (colors and shapes).
  2. The Frequency World: A hidden layer showing the "vibrations" and fine textures of the image.

How it works (The "Cross-Domain Fusion"):
Imagine you are listening to a song.

  • RGB is the melody (the main tune).
  • Frequency is the background static or the specific quality of the instruments.
  • FOCA's Secret Sauce (FAF Module): It uses a special "mixing board" (Cross-Attention) to listen to the melody and the static at the same time. If the melody sounds perfect but the static sounds like it was recorded in a different room, FOCA knows it's a fake. It fuses these two views so the AI can say, "This looks real, but the texture here is suspicious."

3. The "Brain" Upgrade: Talking to the Image

Most detectors just give you a red box around the fake part and say "Fake." FOCA is different because it's built on a Multimodal Large Language Model (MLLM).

  • Think of it as a Detective with a Voice: Instead of just pointing, FOCA can talk to you.
  • The Output: It doesn't just say "Tampered." It says: "This image is tampered. The grass in the bottom left corner looks unnatural because the frequency patterns are too smooth, which suggests it was AI-generated."
  • It uses special tokens (like [SEG] and [CLS]) to act as a highlighter pen, drawing the exact fake area while writing a report on why.

4. The Training Ground: FSE-Set

To teach this detective, the researchers built a massive new training school called FSE-Set.

  • The Curriculum: It contains 100,000 images (50,000 real, 50,000 fake).
  • The Twist: Unlike old schools that just showed pictures, this school shows the pictures and their "frequency fingerprints." It also teaches the AI to write explanations in English, not just math.
  • The Teachers: They used advanced AI tools (like Stable Diffusion and Language-SAM) to create realistic fakes and then used another AI (Claude) to write the "answer keys" explaining exactly what was wrong in both the visual and frequency domains.

5. The Results: Why It Wins

When they put FOCA to the test against other top detectives:

  • Accuracy: It caught more fakes than anyone else (96.2% accuracy).
  • Precision: It drew the "fake" lines more accurately, pinpointing the exact pixels.
  • Explainability: This is the big win. While other models just gave a score, FOCA gave a human-readable explanation. It could tell you where the forgery was and why the frequency domain gave it away.

The Big Picture

FOCA is a game-changer because it stops relying only on what the image looks like and starts analyzing what the image feels like (in terms of data patterns). By combining a powerful AI brain with a "frequency microscope," it can spot the most sophisticated digital forgeries and explain them to us in plain English, helping us trust what we see in the digital world again.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →