Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings
This paper introduces a post-hoc framework to explain, verify, and align the semantic hierarchies in vision-language model embeddings, revealing that while image encoders offer superior discriminative power, text encoders better align with human taxonomies, highlighting a trade-off between zero-shot accuracy and ontological plausibility.