Explaining, Verifying, and Aligning Semantic… — Plain-Language Explanation

Imagine you have a super-smart robot librarian named CLIP. This robot has read millions of books and seen millions of photos. If you show it a picture of a cat, it knows it's a cat. If you show it a picture of a truck, it knows it's a truck. It's incredibly good at matching pictures to words.

However, there's a problem: We don't really know how this robot organizes its brain.

Does it think a "cat" is closer to a "dog" because they are both pets? Or does it think they are closer because they both have fur? Does it know that a "truck" is a type of "vehicle," or does it just see them as two unrelated things that happen to have wheels?

This paper is like a detective agency hired to peek inside the robot's brain, figure out how it sorts things, check if its sorting makes sense to humans, and then gently reorganize its brain if it's getting things mixed up.

Here is the story of their investigation, broken down into three simple steps:

1. The Detective Work: "What is your family tree?"

The researchers asked the robot to sort a bunch of items (like cars, animals, and birds) into a family tree.

The Method: They took the robot's "mental notes" (embeddings) for each item and grouped the most similar ones together.
The Naming: Once the robot grouped "cats" and "dogs" together, the researchers asked, "What do you call this group?" They used a giant dictionary (like a digital encyclopedia) to find the best human word for that group.
The Result: They built a Family Tree of the robot's thoughts.
- Analogy: Imagine the robot puts all the "furry things" in one box and all the "wheeled things" in another. The researchers label these boxes "Animals" and "Vehicles."

2. The Report Card: "Does your logic make sense?"

Now that they have the robot's family tree, they compared it to a Human Family Tree (a standard, logical way humans categorize things, like a biology textbook).

They found some surprising things:

The "Eye" vs. The "Brain":
- When the robot looked at pictures (Image Encoder), it was great at telling a specific dog from a specific cat. It was a sharp-eyed detective. But its family tree was a bit messy. It might group a "frog" with a "car" because they both look "green" or "boxy" in a weird way.
- When the robot looked at words (Text Encoder), it was a bit worse at spotting the exact picture, but its family tree was much more logical. It knew that a "cat" is an "animal" and a "truck" is a "vehicle" because that's how the words are defined in language.
The Trade-off: The robot faces a dilemma. If it tries to be super accurate at identifying specific items (discrimination), it often messes up the big-picture logic (plausibility). If it tries to follow human logic, it sometimes gets confused about the specific details.

3. The Therapy Session: "Let's fix your brain."

The researchers didn't just criticize the robot; they gave it a therapy session (called "Post-hoc Alignment").

The Problem: The robot's brain was slightly "off-kilter." It thought "Frogs" and "Cars" were cousins.
The Fix: They used a tool called UMAP (think of it as a magical map-maker). They showed the robot a "Target Map" (how humans should think) and asked the robot to stretch and squeeze its mental map to match the Target Map.
The Result: They successfully taught the robot to reorganize its thoughts so that "Cats" and "Dogs" were closer together, and "Cars" and "Trucks" were closer together, without making the robot forget how to recognize the pictures.

The Big Takeaway

This paper tells us that AI isn't just a magic black box. We can actually peek inside, see how it's thinking, and compare it to how humans think.

The Good News: We can fix AI's logic. We can teach it to organize the world the way we do, making it more trustworthy and easier to understand.
The Bad News: There is a tug-of-war. The more an AI tries to be perfect at spotting tiny details, the more it might lose the big picture of how things relate to each other.

In short: The researchers built a tool to translate the robot's "alien" way of thinking into human language, checked if it was sane, and if it wasn't, they gently nudged it back to sanity. This helps us build AI that is not only smart but also makes sense to us.

1. Problem Statement

Vision-Language Models (VLMs) like CLIP have achieved remarkable success in zero-shot classification and image retrieval by learning a shared embedding space for images and text. However, the internal semantic organization of this space remains largely opaque. While these models can distinguish between classes, it is unclear how they structure relationships between concepts (e.g., whether they group "cat" and "dog" under "animal" or based on visual features like fur).

Current evaluation metrics focus on task-level performance (accuracy) rather than the plausibility of the learned semantic hierarchies relative to human ontologies. This leads to three core challenges:

Explanation: How can we extract and name the implicit hierarchy induced by a VLM for a given set of leaf classes?
Verification: How plausible are these induced hierarchies compared to human-defined taxonomies (ontologies), and do image and text encoders induce the same structure?
Alignment: Can we post-hoc align a VLM's embedding space to a desired semantic hierarchy without sacrificing zero-shot classification accuracy?

2. Methodology

The authors propose a post-hoc framework consisting of three main stages: Extraction, Verification, and Alignment.

A. Hierarchy Extraction

The goal is to construct a labeled binary tree representing the VLM's internal grouping of leaf classes.

Centroid Computation: For each leaf class, a centroid embedding is computed by averaging the embeddings of training samples (images, text, or both).
Agglomerative Clustering: A hierarchical tree is built using agglomerative clustering (Ward's method) with cosine similarity and average linkage. Internal nodes represent merged clusters.
Node Naming: To make the tree human-interpretable, internal nodes are named by matching their centroid embeddings to a Concept Bank (derived from WordNet and linked to ontologies like SUMO, OpenCyc, and Yago). A linear sum assignment algorithm ensures unique naming.

B. Explainable Inference with Uncertainty-Aware Early Stopping (UAES)

Standard zero-shot classification picks the closest leaf. The authors propose Tree-Traversal Inference, where the model traverses the extracted hierarchy from root to leaf.

Mechanism: At each node, the model chooses the child branch with the closest embedding.
UAES: To prevent "wild guesses" in deep trees, the method implements Uncertainty-Aware Early Stopping. If the similarity scores for children fall below a threshold (determined by the $p$ -th quantile of training similarities), the traversal stops, and the current parent node is returned as the prediction. This improves reliability for unknown or ambiguous classes.

C. Verification Metrics

The extracted hierarchies are compared against reference human ontologies using two metrics:

Global Fit (Tree Edit Distance): Measures the distance between the extracted tree and the closest valid subtree within a reference ontology (e.g., SUMO).
Hierarchical Consistency Score ( $S_{onto}$ ): A local metric that checks if parent-child edges in the extracted tree respect the hypernymy (is-a) paths in the reference ontology. It calculates the shortest path distance in the ontology between the mapped concepts of parent and child nodes.

D. Post-Hoc Alignment

To align the VLM's embedding space with a target hierarchy (e.g., a human ontology or the hierarchy induced by the text encoder), the authors propose a lightweight transformation:

Target Generation: Using UMAP (Uniform Manifold Approximation and Projection), target latent points are generated to satisfy the pairwise distances defined by the target tree structure.
Transformation Learning: A lightweight 2-layer Deep Neural Network (DNN) is trained to map original embeddings to these target points.
Loss Function: The training objective combines three terms:
- Preservation of original cosine geometry ( $\alpha$ ).
- Alignment with the target tree distances ( $\beta$ ).
- Regularization to prevent class collapse ( $\gamma$ ).

3. Key Contributions

A Unified Pipeline: The first framework to systematically explain, verify, and align VLM semantic hierarchies using a concept bank and ontology matching.
Faithfulness vs. Plausibility Trade-off: The study introduces metrics to quantify the trade-off between discriminability (zero-shot accuracy) and ontological plausibility (matching human taxonomies).
Modality Gap Discovery: Empirical evidence showing a consistent divergence between image and text encoders.
Effective Alignment: A method to steer embedding spaces toward desired ontologies with minimal loss in zero-shot performance.

4. Experimental Results

The authors evaluated 13 pre-trained VLMs (including CLIP, ALIGN, FLAVA, SigLIP) across 4 datasets (CIFAR-10/100, ImageNet, CUB) and 3 ontologies (SUMO, OpenCyc, Yago).

Faithfulness Gap: As the number of leaf classes increases (e.g., from CIFAR-10 to ImageNet), the faithfulness of tree-traversal inference drops significantly (from ~94% to ~26%). However, UAES effectively mitigates this by returning semantically appropriate super-categories instead of incorrect leaves, reducing the distance to the ground truth.
The Modality Gap (Key Finding):
- Image Encoders: Produce hierarchies with higher zero-shot accuracy and faithfulness but lower plausibility (they group concepts based on visual similarity rather than semantic taxonomy).
- Text Encoders: Induce hierarchies that better match human ontologies (higher plausibility) but have lower zero-shot accuracy on image tasks.
- Combined: Using both modalities offers a trade-off, balancing discriminability and plausibility.
Correlation: There is a significant negative correlation between zero-shot accuracy and hierarchical consistency. Models that are better at distinguishing classes often have less human-plausible internal structures.
Alignment Success: The post-hoc alignment method successfully transformed image embeddings to match text-based hierarchies or closest valid ontology trees. While this improved the structural match (reduced Tree Edit Distance), it resulted in a moderate trade-off, slightly reducing zero-shot accuracy.

5. Significance and Implications

Beyond Accuracy: The paper argues that high classification accuracy does not guarantee that a model "understands" concepts in a human-interpretable way. It highlights the need for ontological plausibility as a critical evaluation metric for VLMs.
Bias Detection: The framework can identify unintuitive or biased groupings (e.g., grouping animals by fur texture rather than species) by comparing extracted trees against trusted ontologies.
Practical Alignment: The proposed alignment method provides a route to inject human knowledge into VLMs post-hoc, potentially improving robustness and interpretability without retraining the massive foundation models from scratch.
Future Directions: The work suggests that future VLMs should aim to close the gap between discriminative power and semantic plausibility, possibly through better training objectives or ontology-guided pre-training.

In summary, this paper provides a rigorous toolkit to inspect the "black box" of VLM semantic structures, revealing that while these models are excellent discriminators, their internal logic often diverges from human taxonomic knowledge, a gap that can be partially bridged through post-hoc alignment.

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings