Learning Hierarchical Knowledge in Text-Rich Networks with Taxonomy-Informed Representation Learning

Imagine you walk into a massive, chaotic library. This library doesn't just have books; every book is a "node" in a giant web, connected to other books by invisible strings (like citations or co-purchases). Some books are about "Science," others about "Cooking," and some are about "How to bake a cake."

The problem? The librarians (current AI models) are great at reading the words on the page, but they treat every book as a flat, isolated item. They don't realize that "Baking a Cake" is a tiny branch of "Cooking," which is a branch of "Food," which is a branch of "Life." They miss the hierarchy.

This paper introduces a new system called TIER (Taxonomy-Informed Representation Learning) that teaches AI to understand this library not just as a pile of books, but as a structured tree with roots, branches, and leaves.

Here is how TIER works, broken down into simple steps:

1. The Problem: The "Flat" Library

Current AI models are like a student who memorizes every book title but doesn't understand the Dewey Decimal System. If you ask them, "What is similar to a book on 'Sourdough Bread'?", they might say "A book on 'Sourdough Starter'" (good!) but also "A book on 'Bread Machines'" (okay) and maybe "A book on 'The History of France'" (because they both have the word 'history' in the title). They miss the coarse-to-fine structure. They don't know that "Bread" is a sub-category of "Food."

2. The Solution: TIER's Three-Step Magic

TIER acts like a super-smart librarian who reorganizes the library from scratch, using a mix of math and a "Super-Brain" (a Large Language Model or LLM).

Step A: The "Similarity Magnet" (Clustering)

First, TIER uses a technique called Contrastive Learning. Imagine you have a pile of mixed-up socks. You want to group them.

The Old Way: You just look at the color.
The TIER Way: You look at the color and you notice that socks that are often found in the same drawer (connected by edges in the network) probably belong together.
TIER pulls similar documents closer together in a digital space and pushes different ones apart, creating a "clustering-friendly" map.

Step B: The "Smart Sorter" (LLM Refinement)

This is the coolest part. Standard math (K-Means) is good at grouping things that look alike, but it's bad at understanding meaning. It might group "Apple the Fruit" and "Apple the Tech Company" together just because they both have the word "Apple."

TIER brings in an LLM (like a super-smart AI assistant) to act as a quality control inspector.

The Split: If a group is too messy (e.g., it has both "Apple Fruit" and "Apple Tech"), the LLM says, "Hey, split this group in two!"
The Merge: If two groups are actually about the same thing (e.g., "Baking Cakes" and "Making Pies"), the LLM says, "Combine these!"
The Labeling: The LLM reads the books in the group and gives them a human-readable name, like "Desserts" or "Electronics."

This creates a beautiful, hierarchical tree: Food → Desserts → Cakes.

Step C: The "Ruler" (Regularization)

Now that TIER has built this perfect tree, it needs to make sure the AI's internal brain (the embeddings) actually looks like that tree.

Imagine the tree is a blueprint.
TIER uses a mathematical ruler called the Cophenetic Correlation Coefficient.
It checks: "If two items are close cousins on the tree (like 'Cakes' and 'Pies'), are they also close neighbors in the AI's brain?"
If the AI puts them far apart, TIER gently nudges them closer. If the AI puts distant cousins (like 'Cakes' and 'Cars') too close, TIER pushes them apart.

3. The Result: A Smarter, Faster AI

By teaching the AI to respect this hierarchy, TIER achieves two amazing things:

Better Understanding: The AI now knows that "Natural Language Understanding" is a specific type of "Computer Science," and it treats them with the right level of importance. It's not just memorizing words; it's understanding the structure of knowledge.
Efficiency: Because TIER builds the tree once and uses it to guide the learning, it doesn't need to ask the "Super-Brain" (LLM) to read every single book in the library. It only asks the LLM to help organize the groups. This makes it much faster and cheaper than other methods that try to use a giant LLM for every single task.

The Analogy: Organizing a Wardrobe

Old AI: Throws all your clothes into one giant pile. To find a "Red Shirt," it has to dig through everything.
TIER:
1. Sorts clothes into broad piles (Shirts, Pants, Shoes).
2. Asks a smart friend (LLM) to check the piles: "Wait, this pile has both winter coats and summer t-shirts; let's split them."
3. Creates a hierarchy: Clothing → Tops → Shirts → Red Shirts.
4. Trains you to always put your Red Shirt in the "Red Shirts" bin, not just anywhere.

Why This Matters

In the real world, data is rarely flat. Scientific papers, medical records, and product catalogs are all organized in trees. TIER is the first tool that effectively teaches AI to navigate these trees automatically, without needing humans to manually label every single branch. It makes AI smarter, more organized, and much more efficient at finding what you're looking for.

1. Problem Statement

Text-Rich Networks (TRNs) are graphs where nodes contain rich textual content (e.g., academic papers, e-commerce products) and edges represent semantic relationships (e.g., citations, co-purchases). While existing methods effectively integrate text and graph structure, they predominantly focus on flat semantic modeling.

The Core Gap: Real-world documents often possess inherent hierarchical semantic structures (e.g., a paper on "Natural Language Understanding" belongs to "Computation & Language," which belongs to "Computer Science"). Current TRN models fail to capture these coarse-to-fine relationships, leading to representations that lack structural interpretability and struggle to distinguish between documents that are semantically close at different levels of abstraction.

Key Challenges:

Implicit Hierarchies: Most TRN datasets lack ground-truth taxonomies or label hierarchies due to annotation costs.
Integration: How to construct a high-quality taxonomy from unstructured text and graph data, and then seamlessly integrate this hierarchy into the node representation learning process.

2. Methodology: The TIER Framework

The authors propose TIER (Hierarchical Taxonomy-Informed REpresentation Learning), a two-stage framework that constructs an implicit taxonomy and uses it to regularize node embeddings.

Stage 1: Hierarchical Taxonomy Construction

Since ground-truth hierarchies are unavailable, TIER automatically constructs a semantically coherent taxonomy tree.

Similarity-Guided Contrastive Learning (SGCL):
- Goal: Learn a clustering-friendly embedding space that aligns with semantic similarity.
- Mechanism: Instead of treating only self-pairs as positive (standard contrastive learning) or only labeled pairs (SupCon), TIER constructs a Semantic Similarity Matrix ( $S$ ) combining:
  - Label-based: Pairs with known identical labels.
  - Structure-based: Pairs connected by edges (leveraging the homophily assumption).
- Result: This produces embeddings where semantically similar nodes (even if unlabeled) are pulled closer, facilitating better clustering.
LLM-Powered Hierarchical Clustering:
- Bottom-Up K-Means: The model performs hierarchical K-Means clustering on the SGCL embeddings to create a tree structure (from fine-grained clusters to coarse-grained roots).
- LLM Refinement: Standard K-Means often yields geometrically coherent but semantically misaligned clusters. TIER uses a Large Language Model (LLM) to refine the clusters via five steps:
  - Splitting: Identifying and splitting low-cohesion clusters containing multiple topics.
  - Merging: Merging semantically similar clusters that were geometrically separated.
  - Redistributing: Moving outlier samples from small/noisy clusters to larger, stable ones.
  - Labeling: Generating human-readable labels and summaries for each cluster.
  - Reassigning Outliers: Using the LLM to reassign boundary samples based on semantic summaries rather than just vector distance.

Stage 2: Taxonomy-Informed Representation Learning

Once the taxonomy tree $T$ is constructed, TIER aligns the learned node embeddings with this structure.

Cophenetic Correlation Coefficient (CCC) Regularization:
- Concept: CCC measures how well a hierarchical clustering tree preserves the pairwise distances of the original data.
- Implementation: TIER computes the Euclidean distance matrix ( $D$ ) between cluster prototypes in the embedding space and the cophenetic distance matrix ( $D_{coph}$ ) derived from the taxonomy tree (path length in the tree).
- Loss Function: A regularization loss $L_{CCC} = 1 - \text{CCC}(D, D_{coph})$ is added to the training objective.
- Effect: This forces the embedding space to respect the tree structure: nodes in the same fine-grained cluster are close; nodes in the same coarse-grained branch are moderately close; nodes in different branches are far apart.

Overall Objective: $L_{total} = L_{CE} + \lambda \cdot L_{CCC}$ , where $L_{CE}$ is the standard cross-entropy loss for node classification.

3. Key Contributions

Novel Framework: TIER is the first method to explicitly construct an implicit hierarchical taxonomy from TRNs and use it to guide representation learning without requiring pre-existing ground-truth hierarchies.
LLM-Enhanced Taxonomy Construction: Introduces a pipeline using LLMs to refine clustering results, ensuring semantic coherence and interpretability (generating labels/summaries) that traditional clustering algorithms cannot achieve.
Structure-Aware Regularization: Proposes a differentiable CCC-based loss to align the geometry of the embedding space with the semantic hierarchy, enabling the model to learn both fine-grained and coarse-grained relations simultaneously.
Efficiency: Unlike methods that require LLM inference for every node (e.g., GraphGPT, TAPE), TIER only queries the LLM for cluster refinement and outlier handling, significantly reducing computational cost.

4. Experimental Results

The authors evaluated TIER on 8 diverse datasets (Cora, Citeseer, Pubmed, ArXiv, WikiCS, Books, Photo, Computer) from the LLMNodeBed benchmark.

Performance (RQ1): TIER achieved state-of-the-art (SOTA) results on 7 out of 8 datasets and the highest average accuracy (82.62%), outperforming strong baselines including:
- Classic GNNs with shallow features (GCN, SAGE).
- PLM-based methods (SenBERT, RoBERTa).
- Hybrid GNN+PLM methods (GIANT, GLEM).
- LLM-based methods (ENGINE, GCNLLM, TAPE, GraphGPT, LLaGA).
Representation Quality (RQ2): Visualization (t-SNE and distance matrices) showed that TIER produces embeddings with clear block structures, where coarse-grained classes are well-separated and fine-grained sub-classes are tightly clustered, confirming the learning of hierarchical semantics.
Taxonomy Quality (RQ3): The constructed taxonomies (visualized via RadialMaps) were semantically coherent. For example, on Citeseer, "Agents" correctly grouped sub-clusters like "Mobile Agent Security" and "BDI Agent Sociality."
Efficiency (RQ4): TIER is highly efficient. On the large ArXiv dataset, it used only 6.78 GB GPU memory (vs. >60 GB for LLM-based predictors) and trained in 16.8 minutes (vs. >36 hours for TAPE/LLMIT), making it scalable for real-world deployment.
Ablation Study (RQ5): Removing any component (SGCL, LLM refinement, or CCC regularization) led to performance drops, proving that the synergy of all three is essential.

5. Significance and Impact

Bridging the Gap: TIER successfully bridges the gap between unstructured text/graph data and structured hierarchical knowledge, a critical need for domains like biomedical ontologies and scientific literature analysis.
Interpretability: By generating natural language labels for clusters and enforcing a tree structure, TIER provides interpretable models, allowing users to understand why nodes are grouped together at different levels of abstraction.
Scalability: The approach demonstrates that LLMs can be used effectively as "refiners" rather than "encoders" for every node, offering a cost-effective path to leveraging LLMs in large-scale graph learning.
Generalizability: The framework is domain-agnostic, validated across academic, web, and e-commerce networks, suggesting broad applicability for future TRN tasks.

In conclusion, TIER establishes a new paradigm for Text-Rich Network learning by treating hierarchy not as a constraint to be ignored, but as a fundamental signal to be discovered and leveraged for superior representation learning.