Original authors: Viola Negroni, Davide Salvi, Daniele Ugo Leonzio, Paolo Bestagini, Stefano Tubaro

Published 2026-05-07

📖 4 min read☕ Coffee break read

Original authors: Viola Negroni, Davide Salvi, Daniele Ugo Leonzio, Paolo Bestagini, Stefano Tubaro

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to figure out if two audio recordings were made by the same person or machine. Usually, when we check for "deepfakes" (fake audio made by AI), we ask a simple question: "Is this real or fake?"

But this paper suggests that question is too limited. Instead, the authors propose a new way to think about the problem: "Do these two sounds share the same fingerprint?"

Here is a breakdown of their idea using simple analogies:

1. The Core Idea: The "Forensic Fingerprint"

Every time an AI model (like a text-to-speech generator) creates a voice, it leaves behind tiny, invisible imperfections. Think of these like dust motes in a sunbeam or scratches on a vinyl record. Even if two AI models sound perfect to our ears, they leave different "dust patterns" behind.

The authors call this "Forensic Similarity." Instead of trying to name the specific AI model (which is hard because there are thousands of them), their system just asks: "Do these two audio clips have the same dust pattern?"

Yes: They likely came from the same source.
No: They came from different sources.

2. How the System Works: The "Twin Detective"

The system they built works like a pair of twin detectives who have memorized the same rulebook. This is called a Siamese Network.

Step 1: The Feature Extractor (The Scanner)
Imagine a high-tech scanner that looks at a 4-second audio clip and turns it into a unique "ID card" (a mathematical code). This ID card doesn't say who spoke; it only captures the specific "flaws" or "artifacts" left by the machine that made the sound.
- The team tested four different types of scanners (LCNN, ResNet, RawNet2, AASIST) and found that the LCNN scanner was the best at spotting these subtle flaws.
Step 2: The Similarity Network (The Comparator)
Once the system has two ID cards (one for Audio A and one for Audio B), it passes them to a second, smaller brain. This brain compares the two cards and gives a score from 0 to 1.
- Score near 1: "These two look identical. They share the same forensic fingerprint."
- Score near 0: "These look totally different. They came from different machines."

3. Why This is Better Than Old Methods

Old methods tried to memorize a list of known "bad guys" (specific AI models). If a new, unknown AI model appeared, the old system would get confused and fail.

This new system is like a universal translator. It doesn't need to know the name of the AI model. It just looks at the "style" of the imperfections. Even if the AI model has never been seen before, if it leaves the same "dust pattern" as another clip, the system knows they are related. This makes it much harder for new, sneaky deepfakes to fool the system.

4. Real-World Application: The "Audio Puzzle"

The paper also tested this idea on a different problem: Audio Splicing.
Imagine someone takes a real sentence, cuts out a word, and pastes in a fake word made by AI. The result is a "patchwork" audio file.

The authors used their system to slide a small window across the audio track, comparing one second to the next.

If the "fingerprint" stays the same, it's a smooth, honest recording.
If the "fingerprint" suddenly changes (the score drops), it's a clue that someone cut and pasted something in.

The Results:

Source Verification: The system was very good at telling if two clips came from the same AI, even if that AI wasn't in their training data.
Splicing Detection: It could spot where a fake word was inserted, though it was slightly less accurate than the source verification task.

Summary

In short, this paper introduces a tool that doesn't ask "Is this fake?" but rather "Do these two sounds come from the same factory?" By focusing on the shared "manufacturing defects" of AI voices, the system can spot fakes and find where they were edited, even if the AI making them is brand new and unknown.

Technical Summary: Forensic Similarity for Speech Deepfakes

Problem Statement

The proliferation of AI-based generative technologies has led to a significant increase in synthetic media, particularly speech deepfakes, which pose threats to security, reputation, and public trust. While traditional speech deepfake detection systems focus on binary classification (distinguishing real from fake), these approaches often struggle with generalization. Detectors trained on a specific set of generators frequently fail when encountering audio produced by unseen models, as different generators leave distinct forensic traces (artifacts).

Furthermore, framing the problem solely as "real vs. fake" limits forensic utility. Knowing a signal is synthetic is less actionable than understanding its origin. Consequently, the research community is shifting toward source tracing and source verification. Source verification aims to determine whether two audio signals originate from the same generative model without requiring explicit training on every possible generator. This paper addresses the challenge of verifying source identity in open-set scenarios where test samples may come from previously unseen synthesis methods.

Methodology

The authors propose a novel framework based on the concept of forensic similarity, adapted from image forensics to the audio domain. The system employs a Siamese-based architecture consisting of two sequential stages:

1. Feature Extractor (Source Tracing Backbone)

The first component is a feature extractor designed to map input speech signals into dense embeddings that capture generator-specific forensic cues. The framework is agnostic to the backbone architecture, allowing various state-of-the-art deepfake detectors to serve as extractors. The authors experiment with four models:

LCNN and ResNet18: Operating on mel-spectrograms.
RawNet2 and AASIST: Operating on raw waveforms or utilizing graph attention mechanisms.

These models are initially trained as closed-set source tracers (multi-class classification) on a dataset of known generators. The embeddings are extracted from the last hidden layer before the final classification layer.

2. Similarity Network

The second component is a lightweight neural network that takes a pair of embeddings, $f(x_A)$ and $f(x_B)$ , and outputs a similarity score $S \in [0, 1]$ .

Architecture: The network projects embeddings into a lower-dimensional space, concatenates them with their element-wise product, and processes the result through fully connected layers with BatchNorm, Dropout, and LeakyReLU activations.
Output: A Log-Softmax layer produces probabilities for two classes: "same forensic trace" (1) or "different forensic traces" (0).
Training Strategy: The feature extractor is first optimized for source tracing. In the second phase, the similarity model is trained in a Siamese configuration. The authors evaluate two strategies: keeping the feature extractor frozen or fine-tuning it jointly with the similarity model.

3. Application to Splicing Detection

The framework is also applied to audio splicing detection (detecting partially fake speech). Instead of training on spliced tracks, the system analyzes consecutive, non-overlapping windows of a single audio track. It computes a similarity score sequence between adjacent windows. Splicing boundaries are identified as local minima in this sequence where the forensic traces of adjacent segments differ significantly.

Key Contributions

Introduction of Forensic Similarity to Audio: The paper adapts the forensic similarity concept from image forensics to speech deepfake analysis, enabling the determination of whether two audio segments share the same underlying forensic traces.
Source Verification Framework: The authors deploy a Siamese-based system for source verification, demonstrating improvements over prior reference-based approaches (e.g., [25]) by performing direct one-to-one comparisons.
Splicing Detection Alternative: The paper proposes and evaluates a pairwise analysis approach for splicing detection that does not require training on spliced data, offering a potential alternative to existing segment-level classification methods.

Experimental Results

The framework was evaluated on several datasets, including MLAAD (for in-domain open-set testing), ASVspoof 2019, TIMIT-TTS, and PartialSpoof (for splicing detection).

Feature Extractor Performance: Among the tested backbones, LCNN and ResNet18 performed best. Specifically, LCNN with an unfrozen training strategy (fine-tuned during the second phase) achieved the best results, with an Equal Error Rate (EER) of 10.5% and an Area Under the Curve (AUC) of 95.7% on unseen generators in the MLAAD test set.
Similarity Model Efficacy: The proposed similarity network significantly outperformed standard metrics like Euclidean distance, Cosine similarity, and Contrastive Learning. On the MLAAD open-set test, the proposed model achieved an average EER of 22.4% across datasets, compared to ~26-27% for baselines.
Generalization: The system demonstrated robustness to unseen generators, maintaining high performance even when tested on models not present in the training set.
Splicing Detection:
- The system showed robustness to short input windows (0.5s), achieving an AUC of 84.1% for source verification on short segments.
- On the PartialSpoof dataset, the framework achieved an AUC of 80% on the development set (seen generators) and 69% on the evaluation set (unseen generators). The authors note that the model exhibits conservative behavior, rarely misclassifying genuine signals as spliced (low false positive rate), which is desirable for forensic applications.

Significance and Claims

The paper claims that treating deepfake forensics as a similarity-driven problem rather than a binary classification task offers a scalable and robust solution for open-set source verification. By focusing on the relative consistency of forensic traces between two samples, the approach avoids the need to explicitly train on every possible generator, thereby improving adaptability to new synthesis technologies.

The authors highlight that while the system is effective for source verification, its application to splicing detection is promising but requires further refinement. They modestly acknowledge limitations, such as performance drops with very short signal durations and challenges with non-English generators or languages with scarce training data. The work suggests that forensic similarity provides a flexible tool for digital audio forensics, capable of generalizing to previously unseen traces without requiring prior exposure to them during training.

Forensic Similarity for Speech Deepfakes