Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces

Imagine you are teaching a computer to recognize animals.

The Old Way: The "All-or-Nothing" Teacher
Traditionally, AI classifiers treat every mistake as equally bad. If the computer is supposed to identify a Golden Retriever, but it guesses Sushi, the computer thinks, "Oh no, I'm wrong!" If it guesses Labrador, it also thinks, "Oh no, I'm wrong!"

To the old computer, confusing a dog with a fish is the exact same level of disaster as confusing a dog with a dog. It doesn't understand that a Labrador is a "cousin" to a Golden Retriever, while Sushi is a "stranger." This is like a teacher giving a student an "F" for spelling "cat" as "bat" (a small mistake) and also for spelling it as "airplane" (a huge mistake).

The Problem with Current "Smart" Teachers
Researchers have tried to fix this by teaching the AI the family tree of animals (the hierarchy). They want the AI to know that a mistake closer to the truth (Labrador) is better than a mistake far away (Sushi).

However, the paper argues that the current tools used to grade these "smart" teachers are broken. They use metrics that are like a blurry ruler.

The Analogy: Imagine you are judging a race. The current ruler measures the average distance the runners fell behind. But if one runner trips slightly and another trips wildly, the average might look the same. It doesn't tell you who fell where or if the runner who fell far away was actually running in the wrong direction entirely.
The Result: Some AI models get good grades on these broken rulers but are actually making terrible, confusing mistakes.

The Solution: Hier-COS (The "Organized Library")
The authors introduce a new framework called Hier-COS. To understand how it works, let's use a Library Analogy.

Imagine a massive library where books are organized by genre, then sub-genre, then author, then title.

Old AI: Tries to shove every book into a single, flat shelf. When it needs to find a book, it just guesses based on how "close" the cover looks.
Hier-COS: Builds a multi-dimensional, organized library.
- It creates a special "room" (a subspace) for the entire "Fiction" section.
- Inside that room, it creates a smaller "Mystery" corner.
- Inside the Mystery corner, it has a specific shelf for "Detective Novels."
- Finally, it has a specific slot for "Agatha Christie."

When the AI sees a picture of a Golden Retriever, it doesn't just guess a label. It projects the image into this library:

It lands firmly in the "Dog" room.
It settles into the "Retriever" corner.
It finds the "Golden Retriever" slot.

Why is this special?

Adaptive Capacity: Some parts of the library are huge (like "Animals" which has millions of species), and some are tiny (like "Golden Retrievers"). Hier-COS automatically gives more "shelf space" (learning power) to the complex, crowded areas and less to the simple ones. It knows that distinguishing between 500 types of birds is harder than distinguishing between a dog and a cat, so it adjusts its focus accordingly.
Consistency: If the AI guesses "Golden Retriever," it must also be in the "Dog" room and the "Animal" room. It can't guess "Golden Retriever" while thinking it's a "Fish." The structure forces the AI to be logically consistent at every level.

The New Grading System: HOPS
The authors also realized the old grading system was broken, so they invented a new one called HOPS (Hierarchically Ordered Preference Score).

Old Grading: "Did you get the exact right answer? Yes/No. If no, how far off were you on average?"
HOPS Grading: "Let's look at your top 5 guesses. Did you list them in the right order of similarity? Did you put the 'Labrador' before the 'Cat'?"

HOPS rewards the AI for having a good sense of order. Even if it doesn't get the exact right answer, if it puts the most similar things at the top of its list, it gets a high score. It's like grading a student not just on the final answer, but on their logical reasoning process.

The Results
The authors tested this new "Library System" (Hier-COS) on four difficult datasets (like identifying different types of aircraft, birds, and plants).

Outcome: It beat all previous methods. It made fewer "severe" mistakes (confusing a dog with a fish) and was more consistent.
Bonus: It worked great even when using a pre-trained "brain" (a frozen Vision Transformer) that wasn't originally designed for this. It just needed a small "adapter" to learn how to use the library.

In Summary
This paper says: "Stop treating all mistakes as equal. Build AI that understands the family tree of concepts, organizes its knowledge like a structured library, and gets graded on how well it orders its guesses, not just whether it got the single right answer."

1. Problem Statement

Traditional deep learning classifiers treat all class labels as mutually independent, assuming all negative classes are equally incorrect. This approach fails in real-world scenarios where classes possess a semantic hierarchy (e.g., taxonomies defining "is-a" or "part-of" relationships). In such contexts, misclassifying a sample as a semantically similar class (e.g., predicting "Dog" when the truth is "Cat") is less severe than misclassifying it as a distant class (e.g., "Car").

Existing methods attempt to learn hierarchy-aware features but suffer from three critical limitations:

Sub-optimal Representations: They often fail to capture the true hierarchical structure, leading to poor ranking of predictions even when top-1 accuracy is high.
Lack of Adaptive Capacity: They treat all classes equally, failing to allocate more learning capacity to complex, fine-grained classes that share many ancestors.
Inadequate Evaluation Metrics: Standard metrics like Mistake Severity (MS) and Average Hierarchical Distance (AHD) are permutation-invariant (ignoring prediction order), unnormalized, and heavily biased by tree structure (depth, imbalance), making them poor indicators of true hierarchical performance.

2. Methodology: Hier-COS

The authors propose Hier-COS (Hierarchical Composition of Orthogonal Subspaces), a unified framework for both hierarchy-aware multi-class classification and hierarchical multi-level classification.

A. Hierarchy-Aware Vector Spaces (HAVS)

The core theoretical contribution is the definition of a Hierarchy-Aware Vector Space (HAVS). A vector space $V_H$ is an HAVS induced by a hierarchy tree $T$ if the distance between a feature vector and a class subspace strictly correlates with the tree distance (LCA distance) between classes.

Goal: If class $y_j$ is semantically closer to the true class $y_i$ than $y_k$ is, the feature vector for $y_i$ must be closer to the subspace of $y_j$ than to $y_k$ .

B. Orthogonal Subspace Composition

Hier-COS constructs this HAVS using orthogonal bases:

Basis Assignment: Each node in the hierarchy tree is assigned a unique orthonormal basis vector $e_i$ .
Subspace Definition: For any node $v_i$ , its corresponding subspace $V_i$ is the span of the basis vectors associated with its ancestors, itself, and its descendants ( $E_i = E_{ancestors} \cup \{e_i\} \cup E_{descendants}$ ).
Hierarchical Consistency: Because the subspace of a parent node contains the basis vectors of its children, the subspace of a child is a subset of the parent's subspace ( $V_{child} \subset V_{parent}$ ). This mathematically guarantees that predicted labels at all levels form a valid path in the tree.

C. Learning Mechanism

Transformation Module: A lightweight neural module maps features from a pre-trained backbone (e.g., ResNet, ViT) into the Hier-COS vector space.
Loss Function:
- Tree Path KL-Divergence ( $L_{kl}$ ): Encourages the feature vector to distribute its magnitude across the basis vectors corresponding to the path from the root to the leaf class. A monotonically increasing weight function ensures higher magnitude on leaf nodes (for specificity) while maintaining magnitude on ancestors (for hierarchy).
- Regularization ( $L_{reg}$ ): Enforces sparsity, ensuring the feature vector lies primarily within the subspace of the correct class and its ancestors, minimizing projection onto orthogonal complements.

D. Adaptive Learning Capacity

Unlike previous methods that confine features to 1D weight vectors, Hier-COS allows complex classes (those with many descendants) to utilize higher-dimensional subspaces. This implicitly adapts the model's capacity to the semantic complexity of each class.

3. Key Contributions

Theoretical Framework (Hier-COS): A novel method that constructs an HAVS using orthogonal subspaces, theoretically guaranteeing hierarchical consistency without requiring explicit constraints or multi-head architectures.
Unified Classification: It is the first framework to simultaneously perform "hierarchy-aware multi-class" classification and "hierarchical multi-level" classification (predicting labels at all tree levels) using a single classifier.
Adaptive Capacity: The framework naturally allocates more representational dimensions to complex classes based on their position in the hierarchy tree.
New Evaluation Metric (HOPS): The authors introduce the Hierarchically Ordered Preference Score (HOPS).
- Problem Solved: Existing metrics (MS, AHD) are permutation-invariant and ignore the order of top- $k$ predictions.
- Solution: HOPS measures the deviation of the predicted ranking order from the ideal partial preference order (based on LCA distance). It is normalized, interpretable, and accounts for both top-1 accuracy and mistake severity.
State-of-the-Art Performance: Extensive experiments on four datasets (FGVC-Aircraft, CIFAR-100, iNaturalist-19, tieredImageNet-H) demonstrate superior performance across all hierarchical metrics.

4. Experimental Results

The authors benchmarked Hier-COS against strong baselines (Cross-Entropy, Flamingo, HAFrame, etc.) on four datasets:

FGVC-Aircraft (3 levels): Achieved 81.75% top-1 accuracy (vs. 80.55% for HAFrame) and the highest HOPS score (0.89).
CIFAR-100 (5 levels): Achieved 77.79% top-1 accuracy and a significant HOPS improvement (0.93 vs. 0.92 for HAFrame).
iNaturalist-19 (7 levels): With a ViT backbone, Hier-COS achieved 80.81% top-1 accuracy (vs. 78.39% for ViT-CrossEntropy) and a near-perfect HOPS of 0.98.
tieredImageNet-H (12 levels): Despite the extreme depth and imbalance, Hier-COS achieved the best HOPS (0.92) and significantly lower mistake severity (MS) compared to baselines, even with a slight trade-off in top-1 accuracy on this specific difficult dataset.

Key Observation: Hier-COS consistently reduced the "severity of mistakes" and improved hierarchical consistency (measured by Full Path Accuracy) without sacrificing top-1 accuracy. It also demonstrated that existing metrics (AHD) often fail to distinguish between models with vastly different prediction orders, whereas HOPS correctly identifies Hier-COS as superior.

5. Significance and Impact

Redefining Evaluation: The paper critically exposes the flaws in standard hierarchical metrics (MS, AHD) and proposes HOPS as a more robust, ranking-based alternative that aligns better with human intuition about semantic similarity.
Efficiency: Unlike methods requiring separate classifiers for each hierarchy level or complex manifold optimization (hyperbolic embeddings), Hier-COS uses a simple, lightweight transformation module that integrates seamlessly with existing backbones.
Robustness to Imbalance: The orthogonal subspace design inherently handles imbalanced hierarchies by assigning higher-dimensional capacity to complex superclasses, a feature missing in prior art.
Generalizability: The framework is theoretically sound for any tree structure and can be extended to Directed Acyclic Graphs (DAGs) by defining subspaces based on common ancestors, offering a path forward for complex non-tree hierarchies.

In summary, Hier-COS provides a mathematically rigorous, efficient, and high-performing solution for hierarchical classification, bridging the gap between deep feature learning and semantic structure while introducing a new standard for evaluating such models.

Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces

1. Problem Statement

2. Methodology: Hier-COS

A. Hierarchy-Aware Vector Spaces (HAVS)

B. Orthogonal Subspace Composition

C. Learning Mechanism

D. Adaptive Learning Capacity

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Convolutional Surrogate for 3D Discrete Fracture-Matrix Tensor Upscaling

Generating Counterfactual Patient Timelines from Real-World Data

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

SIEVE: Sample-Efficient Parametric Learning from Natural Language

Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models