PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Imagine you are trying to teach a computer to understand the world through pictures and words. You show it a photo of a dog and say, "This is a dog." Then you show it a photo of a dog in a car and say, "This is a dog in a car."

Current AI models (like the famous CLIP) are great at matching pictures to words, but they struggle with two specific types of logic:

The Family Tree (Hierarchy): Knowing that a dog is a type of mammal, which is a type of animal.
The Recipe (Compositionality): Knowing that a dog and a car are two different things that can be mixed together to make a new scene.

Think of it like this:

Hierarchy is like a family tree. It's deep and branching.
Compositionality is like a recipe book. It's about mixing ingredients (concepts) together.

The problem is that most AI models try to squeeze both of these very different structures into a single, flat "box" (a standard mathematical space). It's like trying to fit a complex 3D tree and a flat spreadsheet into the same shoebox. The tree gets squished, or the spreadsheet gets distorted.

Enter PHyCLIP: The "Multi-Drawer" Solution

The authors of this paper propose a new model called PHyCLIP. Instead of one big box, they built a filing cabinet with many separate drawers.

Here is the simple breakdown of how it works:

1. The Drawers are "Hyperbolic" (The Family Tree Drawers)

Imagine one drawer in your cabinet is shaped like a funnel or a tree.

In this specific drawer, the AI organizes concepts like animals.
At the bottom of the funnel is "Animal."
As you go up the sides, it branches out into "Mammal," then "Dog," then "Poodle."
This shape (called Hyperbolic space) is perfect for family trees because it has plenty of room at the top to hold all the specific details without them getting crowded.
The Magic: PHyCLIP has many of these funnel-drawers. One drawer is for animals, one for vehicles, one for food, etc.

2. The Cabinet is "Boolean" (The Mixing Drawer)

Now, how do you combine a dog (from the animal drawer) and a car (from the vehicle drawer)?

In the old models, mixing them was messy.
In PHyCLIP, the "cabinet" itself works like a light switch panel (a Boolean algebra).
If you want "Dog," you flip the switch for the Animal drawer.
If you want "Car," you flip the switch for the Vehicle drawer.
If you want "Dog in a Car," you flip both switches.
The math used to measure distance between these switches is called an $\ell_1$ -product metric. Think of it as simply adding up the distances in each drawer. If the "Dog" part is far from "Cat" in the animal drawer, and the "Car" part is far from "Bike" in the vehicle drawer, the total distance is just the sum of those two differences.

Why is this better?

The Old Way (Single Space):
Imagine trying to draw a map of the world on a flat piece of paper. If you try to show the hierarchy of countries (World > Continent > Country > City) and also show how cities combine (City A + City B = A Trip), the map gets distorted. Cities that are far apart might look close, or family trees get squished.

The PHyCLIP Way:
PHyCLIP says, "Let's keep the family trees in their own special 3D funnels, and let's just add up the scores when we mix them."

Hierarchy: The "Dog" stays neatly organized under "Mammal" inside the Animal funnel.
Composition: When you say "Dog in a Car," the model doesn't squish them together. It just activates the Animal funnel and the Vehicle funnel at the same time.

The Real-World Result

The paper tested this on thousands of images and texts.

Classification: It got better at guessing what an image is (e.g., distinguishing a specific breed of dog from a similar one).
Retrieval: It got better at finding the right picture when you type "a dog in a car" or "a cat on a bike."
Understanding: It learned that "Dog" and "Car" are separate ideas that can be combined, rather than getting confused and thinking "Dog-Car" is a new, weird animal.

The Bottom Line

PHyCLIP is like giving the AI a specialized toolkit instead of a Swiss Army Knife. It uses a funnel-shaped drawer to organize family trees (hierarchy) and a switchboard to mix and match different families (compositionality). By separating these tasks into different mathematical "rooms" and just adding them up, the AI understands the world much more clearly and accurately.

1. Problem Statement

Vision-language models (VLMs), such as CLIP, have achieved remarkable success in mapping images and text to a shared embedding space. However, they struggle to simultaneously represent two distinct semantic structures inherent in natural language and visual data:

Hierarchy (Taxonomy): The "is-a" relationships within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal). This structure is tree-like and grows exponentially with depth.
Compositionality: The conjunction of concepts from different families (e.g., "a dog in a car" combines animals and transportation). This structure resembles a Boolean algebra or a lattice.

The Dilemma:

Euclidean Space: Struggles to embed tree-like hierarchies without high distortion but handles vector addition (composition) well.
Hyperbolic Space: Efficiently captures tree-like hierarchies due to its exponential volume growth but lacks a canonical operation for composition. Standard hyperbolic addition (Möbius addition) does not align with Boolean logic or standard vector addition, making it difficult to represent "A and B" where A and B belong to different taxonomic trees.
Existing Solutions: Previous works either focus solely on hierarchy (using hyperbolic space) or use mixed-curvature spaces (combining Euclidean and Hyperbolic) which often fail to provide a theoretically grounded mechanism for cross-family composition.

2. Methodology: PHyCLIP

The authors propose PHyCLIP, a model that unifies these structures by embedding data into an $\ell_1$ -Product metric space of Hyperbolic factors.

Core Architecture

Factorized Embedding Space: Instead of a single hyperbolic space, the model uses a Cartesian product of $k$ hyperbolic factors: $(\mathbb{H}^d)^k$ .
$\ell_1$ -Product Metric: The distance between two embeddings $X = (x^{(1)}, \dots, x^{(k)})$ and $Y = (y^{(1)}, \dots, y^{(k)})$ is defined as the sum of distances in each factor:
$d_1(X, Y) = \sum_{i=1}^k d_{\mathbb{H}^d}(x^{(i)}, y^{(i)})$
Semantic Mapping:
- Intra-family Hierarchy: Each individual hyperbolic factor $\mathbb{H}^d_i$ is dedicated to a specific concept family (e.g., animals, vehicles). Within a factor, the hyperbolic geometry naturally encodes the taxonomic tree (e.g., dog is closer to the root than poodle).
- Cross-family Composition: The $\ell_1$ -product metric acts analogously to a Boolean algebra. If a concept is present, its corresponding factor is "activated" (moved away from the origin); if absent, it remains near the origin. The composition of "dog and car" activates both the "animal" factor and the "vehicle" factor simultaneously.

Theoretical Foundation

The paper provides theoretical proofs supporting this design:

Trees to Hyperbolic: Metric trees admit low-distortion embeddings into hyperbolic spaces (Theorem 1).
Boolean Lattices to $\ell_1$ : Finite Boolean lattices (representing compositionality) embed isometrically into an $\ell_1$ -product space but cannot be isometrically embedded into a single hyperbolic space (Proposition 1).
Product of Trees: The $\ell_1$ -product of metric trees can be quasi-isometrically embedded into an $\ell_1$ -product of hyperbolic factors (Theorem 2).

Training Objective

PHyCLIP is trained using a combination of two losses on the GRIT dataset (Grounded Image-Text Pairs):

Contrastive Loss ( $L_{cont}$ ): Based on InfoNCE, it pulls matching image-text pairs closer and pushes non-matching pairs apart using the $\ell_1$ -product distance.
Entailment Loss ( $L_{ent}$ ): Uses Hyperbolic Entailment Cones. For a pair where $X \preceq Y$ (e.g., an image is an instance of a caption), the embedding of $X$ must lie within the geodesic cone of $Y$ in every factor. This enforces the hierarchical structure.

3. Key Contributions

Novel Geometry: Introduction of the $\ell_1$ -product of hyperbolic factors, which theoretically and empirically unifies hierarchy (within factors) and compositionality (across factors).
Theoretical Unification: Formal linkage of Boolean lattices to $\ell_1$ -metrics and metric trees to hyperbolic spaces, explaining why this specific product space is superior to standard Euclidean or single hyperbolic spaces for VLMs.
Interpretability: The model learns to automatically assign concept families to specific factors without explicit supervision. For example, one factor naturally learns to represent "mammals" while another represents "vehicles."
State-of-the-Art Performance: PHyCLIP outperforms baselines (CLIP, MERU, HyCoCLIP) across multiple tasks.

4. Experimental Results

The model was evaluated on zero-shot classification, retrieval, hierarchical classification, and compositional understanding.

Zero-Shot Classification: PHyCLIP achieved the best performance on general datasets (e.g., ImageNet, Food-101) and fine-grained datasets (e.g., Oxford-IIIT Pets), demonstrating its ability to handle both broad and specific taxonomies.
Retrieval: It achieved superior Recall@k scores on COCO and Flickr30K, particularly in handling hard negatives where objects are present/absent. The $\ell_1$ metric effectively penalizes mismatches in specific factors.
Hierarchical Classification: On ImageNet with WordNet labels, PHyCLIP showed the lowest Tree Induced Error (TIE) and highest Hierarchical Precision/Recall, indicating that misclassifications were semantically closer to the ground truth.
Compositional Understanding: On benchmarks like VL-CheckList and SugarCrepe (testing object/attribute/relation swaps), PHyCLIP significantly outperformed others. It successfully distinguished "a dog in a car" from "a dog on a bike," proving it captures cross-family compositionality.
Ablation Studies:
- Increasing the number of factors ( $k$ ) generally improved performance, confirming the benefit of factorization.
- Replacing the $\ell_1$ metric with $\ell_2$ (Riemannian) or $\ell_\infty$ metrics caused significant performance drops, validating the necessity of the $\ell_1$ product for compositionality.
- Mixed-curvature models (Euclidean + Hyperbolic) performed worse than the pure hyperbolic product.

5. Significance

PHyCLIP represents a significant step forward in geometric deep learning for vision-language tasks.

Solving the Hierarchy-Composition Trade-off: It resolves the long-standing issue of choosing between hyperbolic (hierarchy) and Euclidean (composition) spaces by creating a hybrid space that mathematically supports both.
Emergent Structure: The model demonstrates that complex semantic structures (taxonomies and Boolean compositions) can emerge automatically from the geometry of the embedding space without explicit structural supervision.
Interpretability: The factor-wise analysis reveals that the model organizes knowledge in a human-interpretable way, separating distinct concept families into orthogonal hyperbolic dimensions.
Future Direction: The work suggests that future VLMs should move beyond single-space embeddings toward structured product spaces to better handle the complexity of real-world semantic relationships.

PHyCLIP: ℓ1\ell_1ℓ1​-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Enter PHyCLIP: The "Multi-Drawer" Solution

1. The Drawers are "Hyperbolic" (The Family Tree Drawers)

2. The Cabinet is "Boolean" (The Mixing Drawer)

Why is this better?

The Real-World Result

The Bottom Line

1. Problem Statement

2. Methodology: PHyCLIP

Core Architecture

Theoretical Foundation

Training Objective

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph

How unconstrained machine-learning models learn physical symmetries

Experiential Reflective Learning for Self-Improving LLM Agents

Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions

PHyCLIP: $\ell_1$ -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning