Information theory for hypergraph similarity

Imagine you are trying to compare two complex social groups, like two different families or two different teams of coworkers.

The Old Way (Graphs):
Traditionally, scientists have looked at these groups by only checking who is friends with whom. They draw a line between Person A and Person B if they talk. This is like looking at a group photo and only counting how many people are holding hands with exactly one other person. It's a simple, two-person (dyadic) view. But in real life, people often interact in bigger groups—three friends grabbing coffee, a whole committee meeting, or a family dinner. The old method misses these "group hugs."

The New Tool (Hypergraphs):
This paper introduces a way to study these "group hugs" properly. Instead of just lines between two people, they use Hypergraphs. Think of a hypergraph as a set of bubbles. Some bubbles hold two people, some hold three, some hold five, and some hold ten. These bubbles represent the actual groups where people interact.

The Problem:
Scientists have had trouble comparing two different hypergraphs (two different groups of bubbles).

Some old methods were too sensitive; if you changed one tiny detail, the whole comparison broke.
Other methods were too slow; they took forever to calculate, like trying to count every grain of sand on a beach one by one.
Many methods couldn't tell the difference between a real connection and a random coincidence. If two groups happened to have a few people in common just by chance, old tools said, "Hey, these groups are similar!" even when they were totally different.

The Solution: The "Compression" Analogy
The authors created a new tool based on Information Theory, specifically a concept called Minimum Description Length (MDL).

Here is the best way to understand it: Imagine you are trying to describe a complex Lego castle to a friend over the phone so they can build an identical one.

The Goal: You want to use the fewest words possible (the shortest "description") to get the job done.
The Trick: If your friend already knows the first half of the castle, you don't need to describe those parts again. You only need to describe the new parts.
The Measure: If you can describe the second castle very quickly because your friend already knows the first one, the two castles are very similar. If you have to write a whole new book to describe the second one, they are very different.

This paper builds a "dictionary" for hypergraphs using this logic. They ask: "How many bits of information do I save if I tell you about Group A before describing Group B?"

The Three Levels of Comparison
The authors built a "hierarchy" of three ways to do this comparison, getting more and more sophisticated:

The "Bulk" Method (The Big Bag):
Imagine dumping all the Lego bricks from both castles into one giant bag and seeing how many are the same. This is simple, but it fails if one castle has mostly tiny bricks and the other has mostly giant bricks. It gets confused by the size differences.
The "Align" Method (Sorting by Size):
This method sorts the bricks by size first. It compares the small bricks to small bricks, and the big bricks to big bricks. This is much better at handling groups of different sizes. It's like comparing the "two-person bubbles" to "two-person bubbles" and "five-person bubbles" to "five-person bubbles."
The "Cross" Method (The Master Key):
This is the most powerful tool. It realizes that sometimes a big group (a 5-person bubble) can explain a smaller group (a 2-person bubble).
- Analogy: If you know a family of five (Mom, Dad, and three kids) is having dinner, you automatically know that the "Mom and Dad" pair is also having dinner. You don't need to list the pair separately; the big group contains the small one.
- The "Cross" method looks for these "nested" relationships. It asks: "Does the big group in Network A explain the small group in Network B?" This allows it to find similarities that the other methods miss completely.

What They Found
The authors tested this on fake data (to make sure it works) and real data (to see if it's useful).

Fake Data: They created random groups and added "noise" (random changes). Their new tool correctly said, "These are different," even when the groups were huge and sparse. Old tools often got fooled by random chance.
Real Data: They looked at three real-world examples:
1. Scientists: Comparing physics fields. They found that "Nuclear Physics" and "Particle Physics" are very similar (they share many group interactions), while "Gas Physics" is quite different.
2. Movies: Comparing movie genres. They found "Thrillers" and "Dramas" are very similar in how actors group together, but "Documentaries" are totally different (because the way people act in docs is unique).
3. Software: Comparing coding teams. They found that tools for "Command Lines," "Development," and "Data Structures" are very similar because they share similar collaboration patterns.

The Bottom Line
This paper gives scientists a new, fair, and fast ruler to measure how similar complex groups are. It doesn't just count who knows who; it understands how people work together in teams of all sizes, and it can tell the difference between a real connection and a lucky coincidence. It's like upgrading from a black-and-white photo of a crowd to a high-definition 3D video that shows exactly how the groups move and interact.

Technical Summary: Information Theory for Hypergraph Similarity

Problem Statement
Comparing networked systems is fundamental to tasks such as clustering, classification, and anomaly detection. While traditional network similarity measures are well-developed for graphs consisting of pairwise interactions, they fail to capture the dynamics of complex systems where interactions involve groups of more than two nodes (higher-order interactions). Existing methods for comparing hypergraphs (generalizations of graphs with edges containing any number of nodes) face significant limitations: many rely on tunable parameters to which results are highly sensitive, while others (based on spectral properties, path lengths, or graphlets) impose computational complexities that scale poorly (at least quadratically) with network size. Furthermore, many current approaches incorporate ad hoc structural features without clear fundamental principles, leading to results that are difficult to interpret and may not generalize across domains. There is a need for a principled, non-parametric framework to quantify structural overlap in higher-order networks while correcting for spurious correlations arising from statistical noise and edge density.

Methodology
The authors construct a general information-theoretic framework for hypergraph similarity based on the Minimum Description Length (MDL) principle. The core idea is to quantify the similarity between two hypergraphs, $G_1$ and $G_2$ , by measuring the amount of information saved when transmitting one hypergraph given knowledge of the other and their structural overlap.

Information-Theoretic Formulation:
The framework defines entropy ( $H_c$ ) and conditional entropy ( $H_c(G_j|G_i)$ ) based on specific encoding schemes ( $c$ ). The mutual information (MI) is calculated as $MI_c(G_1; G_2) = H_c(G_2) - H_c(G_2|G_1)$ . To ensure a uniform scale, this is normalized to a Normalized Mutual Information (NMI) score in the range $[0, 1]$ , defined as:
$NMI_c(G_1, G_2) = 1 - \min \left\{ \frac{H_c(G_2|G_1)}{H_c(G_2)}, \frac{H_c(G_1|G_2)}{H_c(G_1)} \right\}$
This formulation allows for asymmetry in the encoding process, which is crucial for handling nested structures where transmitting lower-order edges from higher-order edges is informationally cheaper than the reverse.
Hierarchy of Encodings:
The paper proposes a hierarchy of three specific encodings to capture different aspects of similarity:
- NMI_bulk: Transmits all hyperedges at once. This captures intra-order similarity but is inefficient for real-world sparse hypergraphs, often inflating similarity scores due to the vast space of possible hyperedges.
- NMI_align: Transmits hyperedges layer-by-layer (by order $\ell$ ), comparing only layers of the same order. This corrects for heterogeneous densities across layers and is robust to statistical noise but fails to capture cross-order similarities.
- NMI_cross: The most flexible measure, it allows the transmission of a layer $G^{(\ell)}_j$ using any higher-order layer $G^{(k)}_i$ (where $k \ge \ell$ ) from the reference hypergraph. This captures both intra-order and cross-order similarity (nestedness). It utilizes a recursive algorithm to efficiently compute overlaps between projected layers without explicitly generating all sub-tuples, enabling scalability to large systems.
Multiscale Extension:
The framework is extended to multiscale similarity by coarse-graining nodes into partitions (e.g., communities). This allows for the comparison of hypergraphs at a macro-scale, assessing similarity in modular structure even when individual hyperedges do not overlap.

Key Contributions

Principled Framework: The introduction of a non-parametric, information-theoretic foundation for hypergraph comparison that avoids arbitrary parameter tuning.
Hierarchy of Measures: The derivation of a hierarchy of NMI measures ( $NMI_{bulk}$ , $NMI_{align}$ , $NMI_{cross}$ ) that progressively capture more granular structural overlaps, including cross-order interactions and nestedness.
Computational Efficiency: The development of a recursive counting scheme for $NMI_{cross}$ that avoids the combinatorial explosion of direct projection, allowing for the efficient comparison of hypergraphs with millions of nodes and large hyperedge orders.
Correction for Spurious Correlations: The method inherently corrects for spurious overlaps that arise from high edge densities or heterogeneous layer densities, which plague simpler overlap-based metrics.

Results
The authors validate the framework through extensive experiments on synthetic and empirical data:

Synthetic Intra-order Similarity: In experiments with random hypergraphs, $NMI_{align}$ successfully distinguishes meaningful overlap from noise in heterogeneous layer densities, whereas $NMI_{bulk}$ inflates similarity scores in high-noise regimes due to density effects.
Synthetic Cross-order Similarity: Using "block-nested" hypergraphs where layers are nested across different orders, $NMI_{cross}$ successfully detects structural similarity even when intra-order similarity is destroyed. In contrast, $NMI_{align}$ fails to detect these cross-order relationships, dropping to near-zero similarity.
Empirical Applications: The framework is applied to three real-world multiplex hypergraphs:
- Physics Collaboration (APS): Reveals high similarity between structurally related fields (e.g., Nuclear and Elementary Particle physics) and dissimilarity between disparate fields.
- Film Industry (IMDb): Identifies high similarity between genres with blurred boundaries (e.g., Thriller and Drama) and low similarity between fundamentally different formats (e.g., Documentaries).
- Software Development (Rust): Detects functional similarities between repository categories (e.g., command line utilities and development tools) based on collaboration patterns.
Anomaly Detection: Applied to temporal Enron email data, the hypergraph similarity measure detects structural anomalies and organizational shifts that pairwise graph similarity measures miss, demonstrating the importance of higher-order dynamics.
Dynamical Relevance: Experiments with SIS contagion processes show that the $NMI_{cross}$ score correlates with the epidemic threshold; hypergraphs with higher structural similarity to a nested reference exhibit earlier epidemic onset, linking structural similarity to dynamical behavior.

Significance
The paper claims to provide foundational tools for the principled comparison of higher-order networks. By leveraging the MDL principle, the proposed measures offer a way to extract salient structural features without relying on ad hoc heuristics or tunable parameters. The work highlights that structural organization in systems with non-dyadic interactions (such as nestedness and cross-order dependencies) is critical for understanding system dynamics. The framework enables the detection of meaningful patterns in empirical higher-order networks that are invisible to traditional pairwise methods, shedding light on the structural organization of complex systems ranging from scientific collaboration to social contagion. The authors note that while the current hierarchy focuses on node-aligned hypergraphs, the framework is flexible enough to be extended to multiscale comparisons and other encoding schemes in future work.

More like this