MLLM-based Textual Explanations for Face Comparison

Imagine you are a detective trying to solve a case where two people might be the same person, but one is wearing a disguise, standing in the dark, or looking away from the camera. You have a super-smart AI assistant (a Multimodal Large Language Model, or MLLM) that can look at these photos and tell you, "Yes, these are the same person" or "No, they are different."

But here's the catch: You don't just want a "Yes" or "No." You want the AI to explain why. You want it to say, "They look the same because they both have a crooked nose and a scar on the chin."

This paper is like a report card for that AI assistant, specifically testing how good it is at giving those explanations when the photos are messy and difficult (like surveillance footage).

Here is the breakdown of what the researchers found, using some everyday analogies:

1. The "Confident but Wrong" Problem

The researchers tested the AI on a very hard dataset called IJB-S, which is full of photos where people are turning their heads, squinting, or are in poor lighting.

The Scenario: The AI looks at two photos of the same person (one facing forward, one in profile) and correctly says, "These are the same person!"
The Problem: When asked why, the AI starts making things up. It might say, "They have the same ear shape," even though you can't see the ear in one of the photos. It's like a student who gets the math answer right but writes down a completely made-up formula to get there.
The Metaphor: Imagine a tour guide who knows the city perfectly but, when asked to describe a building they can't see clearly, invents a "famous blue door" that doesn't exist. The guide is right about the location, but wrong about the details. This is called hallucination.

2. The "Cheat Sheet" Experiment

The researchers wondered: "What if we give the AI a cheat sheet?"
They tried feeding the AI not just the photos, but also the scores and decisions from traditional face recognition systems (the old-school, very accurate math-based systems).

The Result: The cheat sheet helped the AI get the final verdict right more often. It was better at saying "Match" or "No Match."
The Twist: Even with the cheat sheet, the AI's explanation didn't get any more honest. It still made up details to justify its answer. It was like giving a student the answer key; they might get the grade right, but their essay explaining the solution is still full of lies.

3. The "Lie Detector" for Explanations

Since the AI's explanations are often unreliable, the researchers built a new tool to measure them. They didn't just ask, "Is the explanation true?" (because it's hard to check that). Instead, they asked, "Does this explanation feel like it belongs to a real match or a fake one?"

The Analogy: Think of it like a polygraph test for text.
- The researchers taught the system what "honest" explanations look like (based on thousands of examples where the AI knew the truth).
- Then, they fed it new explanations and calculated a Likelihood Ratio.
- If the explanation sounds like the "honest" pattern, the score goes up. If it sounds like the "hallucinated" pattern, the score goes down.
- This allows them to judge the trustworthiness of the explanation separately from whether the AI got the final answer right.

4. The Big Trade-off

The paper highlights a frustrating reality in modern AI:

Old Systems (The Math Experts): They are incredibly accurate at saying "Yes/No" but are silent. They give you a number, not a story.
New Systems (The Storytellers): They are great at telling a story and explaining things, but they often lie about the details to make the story sound good.

The Takeaway

The main message is: Don't trust the AI's story just because it got the answer right.

If you are using AI for security or legal reasons (like identifying a suspect), you cannot rely on its natural language explanation as proof. The AI might be right about the identity but wrong about why it thinks so. The researchers suggest we need new ways to test if an AI's explanation is actually grounded in reality, not just a clever-sounding guess.

In short: The AI is a great guesser, but a terrible witness. It can point out the suspect, but its testimony in court would likely be full of made-up details.

1. Problem Statement

The paper addresses a critical gap in Explainable AI (XAI) for biometric systems. While Multimodal Large Language Models (MLLMs) can generate natural-language explanations for face recognition (FR) decisions, their reliability in unconstrained environments (e.g., extreme pose variations, surveillance imagery) is unproven.

The Core Issue: MLLMs often produce explanations that are hallucinated or based on linguistic priors rather than visual evidence. Even when an MLLM makes a correct verification decision (Match/Non-Match), the accompanying text may describe non-existent facial features or rely on generic attributes not supported by the image.
The Challenge: In forensic and security applications, explanations are often treated as evidence. If an explanation is unfaithful to the visual data, it undermines the trustworthiness of the system, regardless of the decision's accuracy.
The Trade-off: Traditional Commercial Off-The-Shelf (COTS) FR systems achieve high accuracy but provide no explanations, while MLLMs provide explanations but suffer from lower accuracy and potential hallucinations.

2. Methodology

The authors propose a systematic evaluation framework and a new metric to quantify the reliability of MLLM-generated explanations.

A. Dataset and Setup

Training Data: A subset of the BUPT-CBFace dataset (13,200 pairs) used to train the evaluation model.
Testing Data: The IJB-S dataset (10,000 pairs), a challenging benchmark known for extreme pose variations and still-to-still verification.
Models Evaluated: GPT-4o and Gemini-2.5.
Prompting Strategies: The study tests four levels of prompting to see how auxiliary FR information affects explanations:
1. Grounded: Images + Ground Truth labels (used only for training).
2. No-Score: Images only (visual evidence only).
3. Score-Only: Images + Similarity scores from FR systems.
4. Score+Decision: Images + Similarity scores + Binary match/non-match decisions.

B. Proposed Framework: Likelihood Ratio (LR) Evaluation

To evaluate explanations independently of the categorical decision accuracy, the authors introduce an LR-based framework:

Text Generation & Embedding: MLLMs generate explanations, which are encoded into fixed-dimensional vectors using a frozen text embedding model (text-embedding-3-small).
Dimensionality Reduction: Principal Component Analysis (PCA) reduces dimensions while retaining 97% variance.
Gaussian Mixture Modeling (GMM): Two GMMs are trained on the embedding space:
- $P_0(z)$ : Distribution of explanations for Genuine (Match) pairs.
- $P_1(z)$ : Distribution of explanations for Impostor (Non-Match) pairs.
Likelihood Ratio Calculation: For a new explanation $z$ , the system calculates the likelihood ratio $\Lambda(z) = P_0(z) / P_1(z)$ .
Scoring: The ratio is mapped to a normalized score $S_{expl}$ , representing the evidential strength of the explanation. This measures how well the text aligns with the expected distribution of genuine vs. impostor reasoning, independent of whether the final decision was correct.

3. Key Contributions

Systematic Evaluation: The first comprehensive analysis of MLLM-generated explanations for face verification under extreme pose variations, revealing a significant gap between decision correctness and explanation faithfulness.
Auxiliary Information Analysis: An investigation into whether providing MLLMs with traditional FR outputs (scores/decisions) improves explanation reliability.
LR-Based Framework: A novel, model-agnostic framework to quantify the evidential strength of textual explanations, moving beyond simple accuracy metrics.
Empirical Insights: Identification of when MLLMs rely on visual grounding versus linguistic priors, highlighting that "correct" decisions do not guarantee "faithful" explanations.

4. Experimental Results

A. Categorical Verification Performance

Baseline: Without FR assistance, MLLMs struggle with extreme poses (e.g., GPT-4o genuine accuracy ~69%).
With FR Assistance: Providing similarity scores and decisions significantly improves impostor detection (up to 98.6% for GPT-4o) and genuine accuracy (up to 75.1%).
Comparison: While MLLMs with FR guidance improve, they still lag behind COTS systems (which achieved 99.69% genuine accuracy and 100% impostor rejection) and often output "uncertain" for difficult genuine pairs.

B. Explanation Faithfulness & Hallucination

Visual Grounding: Even with correct decisions, explanations frequently rely on non-verifiable attributes (e.g., claiming specific skin tone consistency or ear structure visibility that is impossible to verify due to pose).
Cluster Separability:
- When ground-truth labels are provided during generation, explanation embeddings form well-separated clusters.
- In unconstrained testing (IJB-S), clusters overlap significantly.
- Key Finding: Incorporating FR scores improves cluster separation (measured by Silhouette coefficient, Fisher ratio, etc.), but does not eliminate hallucinations. The text becomes more consistent with the decision, but not necessarily more grounded in visual evidence.

C. Likelihood Ratio (LR) Evaluation

The LR framework successfully distinguishes between explanations that are statistically consistent with genuine vs. impostor hypotheses.
Prompts combining FR scores and decisions yielded the highest LR scores, indicating stronger evidential consistency.
However, the study confirms that decision accuracy does not imply explanation faithfulness. An MLLM can be "right" about the match but "wrong" about the visual reasons why.

5. Significance and Conclusion

Critical Limitation: The paper demonstrates that current MLLMs are not yet reliable for forensic or security applications where explanations serve as evidence. The "black box" nature of MLLM reasoning persists even when augmented with traditional FR data.
Methodological Advance: The proposed LR framework offers a principled way to evaluate the quality of reasoning separate from the quality of the decision, a crucial distinction for high-stakes biometric systems.
Future Direction: The authors emphasize the need for methods that can directly link textual attributes to visual evidence (visual grounding) rather than relying on statistical consistency in embedding space.

In summary, while MLLMs offer a promising path toward explainable face recognition, this work proves that accuracy in decision-making does not equate to reliability in explanation. Current models frequently hallucinate attributes, and simply feeding them traditional FR scores improves accuracy but fails to fully resolve the issue of unfaithful reasoning.