Explainable embeddings with Distance Explainer

Imagine you have a giant, invisible library where every book, photo, and song is assigned a specific seat based on how similar it is to everything else. This is what AI calls an "embedded space."

In this library:

A picture of a bee sits right next to a picture of a fly.
A picture of a bee sits far away from a picture of a car.
A sentence saying "a bee on a flower" sits right next to the picture of the bee.

The problem? We humans can't see why the AI decided to put the bee next to the fly. The AI just knows they are "close" in its mathematical map, but it can't tell us which parts of the bee (the wings? the stripes?) made it look like a fly.

This paper introduces a new tool called Distance Explainer to solve this mystery. Here is how it works, using simple analogies.

The Problem: The "Black Box" Distance

Think of the AI as a judge in a contest. It looks at two contestants (say, a photo of a bee and a photo of a fly) and says, "These two are 90% similar."
But if you ask, "Why?" the AI usually just shrugs. It's like a judge who gives a score but won't explain which specific move earned the points.

The Solution: The "Masking Game"

The authors took an existing trick called RISE (which was used to explain single images) and adapted it for comparing two things. They call their new method Distance Explainer.

Here is the step-by-step process, imagined as a game of "Hide and Seek":

The Setup: You have your "Target" (the bee photo) and a "Reference" (the fly photo).
The Game: The AI plays a game where it randomly covers up (masks) parts of the Target photo with a black blanket.
- Example: It covers the bee's wings.
- Example: It covers the bee's stripes.
- Example: It covers the bee's eyes.
The Check: After covering a part, the AI asks: "Does this covered-up bee still look like the fly?"
- If you cover the wings, the bee suddenly looks very different from the fly. The distance between them grows huge.
- If you cover the stripes, the bee still looks a bit like the fly. The distance doesn't change much.
The Scorecard: The AI repeats this thousands of times, covering different random spots.
- It keeps a tally: "Every time we covered the wings, the similarity dropped drastically."
- "Every time we covered the stripes, the similarity stayed the same."
The Result: The AI draws a heat map (a picture with red and blue colors).
- Red areas are the parts that made the two images dissimilar (if you hide them, they look more alike).
- Blue areas are the parts that made them similar (if you hide them, they look less alike).

Why This is Special

Previous tools could only explain why an AI thought, "This is a bee." This new tool explains relationships. It answers: "Why does the AI think this bee is closer to a fly than to a car?"

It works like a detective who doesn't just look at the crime scene, but compares two suspects side-by-side to see exactly what features make them look alike or different.

The "Mirror" Trick

The authors added a clever twist called the "Mirror Mode."
Imagine you are trying to hear a whisper in a noisy room. If you listen to the noise and subtract it, you hear the whisper better.

The AI looks at the parts that make the images very different (the "noise").
It also looks at the parts that make them very similar.
By comparing these two lists, it cancels out the "noise" and highlights the true, important features. This makes the explanation much clearer and less fuzzy.

What They Found

They tested this on:

Images vs. Images: Showing that the AI knows a bee and a fly are similar because of their wings, but different because of their stripes.
Images vs. Text: Showing that if you show a picture of a bee and type "a fly," the AI knows exactly which parts of the picture contradict the text.

The Takeaway

This tool is like giving a pair of glasses to the AI. Suddenly, we can see why the AI's internal map is arranged the way it is. It doesn't just tell us the distance between two points; it tells us which features are pulling them together and which are pushing them apart.

This is a huge step forward for Explainable AI (XAI) because it helps us trust complex models (like those used in medical diagnosis or self-driving cars) by showing us the specific reasons behind their decisions, rather than just the final result.

1. Problem Statement

While Explainable AI (XAI) has advanced significantly for single-input tasks (e.g., image classification, text generation), there is a critical gap in explaining embedded vector spaces.

The Challenge: Deep learning models often project data (images, text, etc.) into high-dimensional vector spaces where dimensions represent complex, abstract features. In these spaces, the "distance" between two vectors determines their semantic similarity.
The Gap: Existing XAI methods (like RISE, LIME, GradCAM) typically explain a model's decision for a single input relative to a class or output. They struggle to explain pairwise relationships (why is point $A$ closer to point $B$ than to point $C$ ?) in arbitrary embedded spaces.
Specific Limitations of Prior Work: Previous attempts to explain pairwise similarity (e.g., S-RISE, CorrRISE) were often domain-specific (face recognition) or relied on weighted summation that failed in high-dimensional spaces where distance differences are negligible.

2. Methodology: Distance Explainer

The authors propose Distance Explainer, a model-agnostic, post-hoc attribution method adapted from the RISE (Randomized Input Sampling for Explanation) algorithm.

Core Concept

Instead of explaining a class probability, the method explains the distance (specifically cosine distance) between two embedded points:

Reference ( $r$ ): A fixed data point (e.g., an image or caption).
To-be-explained ( $e$ ): The input data point being analyzed.
Goal: Identify which features in $e$ contribute most to its proximity (or distance) to $r$ .

Algorithm Steps

Masking: The algorithm generates $N_{masks}$ random binary masks for the input $e$ .
Perturbation: Each mask is applied to $e$ (replacing masked pixels with a baseline, e.g., black), creating a perturbed input $M_i(e)$ .
Embedding & Distance Calculation: Both the reference $r$ and the masked input $M_i(e)$ are passed through the model to obtain embeddings. The cosine distance $d_i$ between them is calculated.
Distance-Ranked Mask Filtering (Key Innovation):
- Unlike RISE, which uses a weighted sum based on class scores, Distance Explainer cannot use weights because the output is a scalar distance, not a probability distribution.
- The authors found that in high-dimensional spaces, distance differences are too small to serve as effective weights.
- Solution: Instead of weighting, they filter masks based on the resulting distance:
  - Top $x\%$ : Masks that result in the smallest distance to the reference (highlighting features that make $e$ look like $r$ ).
  - Bottom $x\%$ : Masks that result in the largest distance (highlighting features that make $e$ look unlike $r$ ).
Aggregation:
- Mirror Mode (Default): The top and bottom sets are combined. The bottom set is multiplied by $-1$ and subtracted from the top set. This acts as a noise-canceling mechanism, improving the signal-to-noise ratio by contrasting "similar" vs. "dissimilar" features.
- The remaining masks are summed to produce the final attribution map.

Key Parameters

Number of Masks ( $N_{masks}$ ): Typically 1000 for stability.
Mask Coverage ( $p_{keep}$ ): The percentage of pixels kept unmasked (default 0.5).
Feature Resolution: The grid size for superpixel masking (default 8x8).
Selection Threshold: The percentage of masks selected from the top/bottom tails (default 10% per side).

3. Key Contributions

Novel Attribution Framework: Introduced the first general method for generating local, post-hoc explanations for distances in arbitrary embedded spaces, not just classification outputs.
Distance-Ranked Filtering: Replaced the standard weighted-sum approach of RISE with a filtering and subtraction strategy (Mirror Mode). This solves the problem of indistinguishable weights in high-dimensional vector spaces.
Modality Agnosticism: The method is designed to work across modalities (Image-Image, Image-Text) provided a masking function exists for the input type.
Comprehensive Evaluation: Validated the method using standard XAI metrics (Faithfulness, Sensitivity/Robustness, Randomization) on both ImageNet classifiers and CLIP models.

4. Experimental Results

The authors evaluated the method on ImageNet (ResNet50, VGG16) and CLIP (ViT-B/32) models.

Quantitative Performance

Faithfulness (Correctness): Measured via Incremental Deletion.
- LoDF (Low Distance First): Removing pixels that contribute most to similarity caused the largest drop in similarity (high faithfulness).
- HiDF (High Distance First): Removing pixels that contribute to dissimilarity caused the similarity to increase (or distance to decrease).
- Results showed the method correctly identified salient features driving the distance metric.
Sensitivity/Robustness: Measured via Average Sensitivity.
- The method showed low sensitivity to small input perturbations (scores of 0.04–0.06), indicating stable and robust explanations.
Randomization (Model Dependency): Measured via Model Parameter Randomization Test (MPRT).
- When model weights were randomized, the attribution maps changed significantly (low correlation with original maps), proving the explanations rely on the specific learned parameters of the model rather than prior knowledge or artifacts.

Qualitative Assessment

Image-Image: Successfully highlighted distinguishing features.
- Example (Bee vs. Fly): Wings were highlighted as features reducing distance (similarity), while stripes were highlighted as features increasing distance (dissimilarity).
- Example (Labradoodle): Eyes and collars were identified as distinctive features.
Image-Caption (CLIP): The method successfully explained why an image matched a specific caption (e.g., "a bee sitting on a flower") and why it differed from others.
Parameter Sensitivity:
- Increasing the number of masks improved stability (reduced standard deviation).
- The "Mirror" selection mode (using both top and bottom tails) provided better noise cancellation than one-sided selection.
- Optimal mask coverage ( $p_{keep}$ ) was found to be around 0.5; extreme values (0.1 or 0.9) resulted in noisy maps.

5. Significance and Future Work

Significance: This work bridges a major gap in XAI by enabling the interpretation of relational decisions in deep learning. It is particularly valuable for multi-modal models (like CLIP) and scientific applications where understanding why two data points are semantically close is crucial (e.g., drug discovery, language acquisition research).
Limitations & Future Directions:
- Computational Cost: The method requires $N_{masks}$ forward passes, making it computationally expensive (similar to RISE).
- Out-of-Distribution (OOD) Risks: Random masking with black pixels can create inputs far from the training distribution. The authors suggest exploring learned infilling or blurring as alternatives.
- Generalization: While tested on images, the framework is designed for text and tabular data (using DIANNA's masking functions), though this requires further validation.
- Human Interpretability: Future work should involve user studies to determine how well non-experts can interpret these distance-based attribution maps.

In conclusion, Distance Explainer provides a robust, mathematically grounded approach to visualizing the "why" behind vector space similarities, enhancing transparency in complex deep learning applications.