Property-Driven Evaluation of GNN Expressiveness at Scale: Datasets, Framework, and Study

Imagine you are hiring a team of super-smart detectives (these are the Graph Neural Networks, or GNNs) to solve mysteries. Their job is to look at a web of connections—like a map of friends, a chemical molecule, or a traffic network—and answer specific questions: "Is this map perfectly organized?" or "Does everyone have exactly one boss?"

For a long time, we've been testing these detectives with simple riddles. But this paper asks a bigger question: Are our detectives actually smart enough to spot the subtle, complex rules that govern the real world, or are they just guessing?

Here is the story of how the authors built a "super-test" to find out, using a creative mix of logic, puzzles, and stress tests.

1. The Problem: The "Guessing Game"

Imagine you want to test if a detective can spot a "perfectly organized office" (where everyone has exactly one boss).

The Old Way: You randomly generate 1,000 office layouts. 999 of them are messy, and only 1 is perfect. You show them to the detective. They guess "Messy" 999 times and get it right. But they didn't actually learn the rule; they just guessed "Messy" because it's common.
The New Way: You need a test where the "Messy" and "Perfect" offices are equally common, and sometimes they look almost identical, differing by just one tiny detail.

2. The Solution: The "Alloy" Magic Box

To create this perfect test, the authors used a tool called Alloy. Think of Alloy as a magic recipe book for logic.

Instead of randomly drawing graphs and hoping for the best, the authors wrote a "recipe" (a formal specification) for a specific rule, like "No two people can point at each other" (Antisymmetry).
The Alloy "chef" then cooked up thousands of graphs that strictly followed this recipe, and thousands that strictly broke it.
They didn't just stop there. They created two types of test sets:
- GraphRandom: A mix of clear "Yes" and clear "No" examples.
- GraphPerturb: The "Stress Test." Here, they took a "Yes" graph and changed just one or two edges (like moving one person's desk) to turn it into a "No" graph. This is like asking the detective, "Can you tell the difference between these two nearly identical twins?"

They built 352 different test sets covering 16 different rules (from simple things like "no loops" to complex things like "total order," which is crucial for things like scheduling or voting systems).

3. The Three Ways to Test a Detective

The authors didn't just ask, "Did they get the right answer?" They looked at how the detectives thought, using three specific lenses:

Generalizability (The "Big Picture" Test):
- The Analogy: You teach the detective on a small town map. Can they solve the mystery on a massive city map?
- The Result: Most detectives were surprisingly good at this. If they learned the rule on a small graph, they could usually apply it to a bigger one.
Sensitivity (The "Microscope" Test):
- The Analogy: You show the detective two almost identical maps. One has a tiny traffic jam; the other doesn't. Can they spot the difference?
- The Result: This was hard. Many detectives got confused. They could see the big picture but missed the tiny, crucial details that changed the answer.
Robustness (The "Curveball" Test):
- The Analogy: You teach the detective on simple maps, then throw a weird, complex, slightly broken map at them. Do they panic, or do they stick to the rules?
- The Result: This was the hardest test. Most detectives crumbled when the graphs got slightly messy or different from what they saw in training.

4. The Star of the Show: The "Global Pooling" Mechanism

The paper focuses on a specific part of the detective's brain called Global Pooling.

The Analogy: Imagine the detective looks at every person in the room (nodes) and gathers their clues. Global Pooling is the step where they summarize all those clues into one final verdict.
The Question: Does the way they summarize the clues matter?
The Findings:
- Simple Summaries (Mean/Sum): Like taking an average. Good for general stuff, but misses the nuance.
- Attention (The "Focus" Method): Like a detective who says, "Ignore the noise, look at THIS specific person." These methods were great at handling big, complex maps (Generalizability) and staying calm under pressure (Robustness).
- Second-Order (The "Relationship" Method): Like a detective who looks at how people relate to each other, not just who they are. These were the best at spotting tiny differences (Sensitivity).

5. The Big Takeaway

The paper concludes that no single detective is perfect at everything.

If you need to spot tiny errors, you need a "Second-Order" detective.
If you need to handle huge, messy networks, you need an "Attention" detective.
Currently, most AI models are using a "one-size-fits-all" strategy, which is why they fail at complex real-world tasks.

The Future: The authors suggest we should build adaptive detectives—AI that can switch its strategy depending on the puzzle. Sometimes it should focus on the big picture; other times, it should zoom in on the tiny details. By using this rigorous, "property-driven" testing method, we can finally build AI that is not just smart, but reliable enough to trust with real-world problems like designing new drugs or managing power grids.

1. Problem Statement

Graph Neural Networks (GNNs) are widely used for graph-structured data, but their expressiveness—specifically their ability to capture fundamental graph properties—remains an open challenge.

Limitations of Existing Methods: Traditional evaluation relies on the Weisfeiler-Lehman (WL) test, which focuses on structural distinguishability but lacks granularity regarding specific logical properties. Recent work (e.g., Zhang et al.) evaluated GNNs on a single property (biconnectivity), lacking generality.
The Gap: There is no comprehensive framework to evaluate GNNs across a broad spectrum of fundamental graph properties (e.g., transitivity, order, symmetry) or to systematically assess how specific architectural components, such as global pooling, impact this expressiveness.
Data Scarcity: Generating class-balanced datasets where graphs strictly satisfy or violate specific logical properties is computationally infeasible using random generation due to the sparsity of positive samples in the graph space (e.g., for "Total Order" on 13 nodes, positive samples are $\approx 10^{-47}$ of the space).

2. Methodology

The authors propose a Property-Driven Evaluation Methodology consisting of three core components: a dataset generator, an evaluation framework, and a quantitative metric system.

A. Dataset Generation via Alloy

To overcome data scarcity, the authors leverage Alloy, a formal specification language and analyzer based on relational logic.

Mechanism: Alloy specifications define graph properties (e.g., reflexivity, transitivity) as logical constraints. The Alloy Analyzer performs bounded, exhaustive enumeration to generate graphs that satisfy (positive) or violate (negative) these constraints without post-generation filtering.
Dataset Families:
1. GraphRandom: Contains 176 datasets covering 16 properties. For each property, it includes graphs of varying sizes (Base Size + 1 to +10). It features balanced positive/negative samples (10,000 per dataset).
2. GraphPerturb: Contains 176 datasets designed to stress-test sensitivity. Each positive sample is paired with a structurally similar negative sample that differs by only one or two edges (flipping bits in the SAT solution). This forces models to rely on deep structural understanding rather than superficial patterns.

B. Evaluation Framework

The framework assesses GNN expressiveness across three key dimensions using the generated datasets:

Generalizability: Can the model trained on small graphs (Base Size) perform well on larger, unseen graphs? (Tested: GraphRandom-Train $\to$ GraphRandom-Test).
Sensitivity: Can the model distinguish between structurally similar graphs with different labels? (Tested: GraphPerturb-Train $\to$ GraphPerturb-Test).
Robustness: Can a model trained on standard graphs generalize to the challenging, perturbed variations? (Tested: GraphRandom-Train $\to$ GraphPerturb-Test).

C. Quantitative Metrics

Unified Score ( $U\_score$ ): A weighted accuracy metric that prioritizes performance on larger graphs to ensure fair cross-dataset comparison.
$U\_score = \frac{\sum_{j=1}^{10} (\text{accuracy}_j \times \text{gsize}_j)}{\sum_{j=1}^{10} \text{gsize}_j}$
Relative Score ( $R\_score$ ): Normalizes a model's performance against the average performance of all peer models for a specific property and aspect, allowing for ranking and identification of specific strengths/weaknesses.

3. Key Contributions

Scalable Dataset Generation: Transformed Alloy into a reproducible generator, producing 352 high-quality, balanced datasets (176 GraphRandom + 176 GraphPerturb) covering 16 fundamental graph properties (Basic, Function-related, and Combined).
General Evaluation Framework: Established a standardized benchmark for GNN expressiveness focusing on Generalizability, Sensitivity, and Robustness, introducing novel metrics ( $U\_score$ , $R\_score$ ).
First Systematic Study on Global Pooling: Conducted the first large-scale study evaluating nine state-of-the-art global pooling methods (Mean, Sum, DeepSets, Set2Set, Soft Attention, Set Transformer, GMT, SoPool-BiMap, SoPool-Attentional) across the defined properties and aspects.

4. Key Results

The study evaluated nine pooling methods using an ID-GNN backbone (fixed node embeddings) to isolate the effect of pooling.

Generalizability: Most methods generalize well to larger graphs, particularly for function-related properties (e.g., injectivity, surjectivity), where scores approach 1.0. Performance is mixed for basic properties (e.g., transitivity, irreflexivity).
Sensitivity: A significant performance drop is observed compared to generalizability.
- Function-related properties remain tractable.
- Basic and Combined properties are highly challenging. For instance, Connex and Total Order properties result in near-random performance (~0.5 accuracy) for most methods, indicating an inability to detect subtle structural violations.
- Second-order pooling (SoPool-BiMap) showed the best sensitivity overall.
Robustness: This is the weakest aspect. Most methods suffer drastic performance drops (often >35%) when tested on perturbed graphs.
- Attention-based methods (Soft Attention, Set Transformer) demonstrated superior robustness and generalizability.
- Second-order methods excelled in sensitivity but struggled with robustness on certain properties.
Property-Specific Insights:
- No single pooling method dominates all properties.
- Attention-based approaches are best for robustness and generalization.
- Second-order approaches are best for sensitivity.
- Simple methods (Mean/Sum) often perform comparably to complex neural pooling methods on function-related tasks.

5. Significance and Future Directions

Theoretical Impact: The work moves beyond the WL hierarchy to evaluate GNNs on specific logical properties, providing a more nuanced understanding of what GNNs can and cannot learn.
Practical Implications: The findings highlight that global pooling is a critical bottleneck in graph-level representation learning. Current pooling strategies often fail to preserve fine-grained structural details necessary for tasks like distributed system consistency (Total Order) or knowledge graph reasoning (Antisymmetry).
Future Research Directions:
1. Property-Aware Adaptive Pooling: Dynamically selecting pooling strategies based on graph-level signals.
2. Graph-Size-Aware Architectures: Incorporating size encodings to maintain performance as graphs scale.
3. Robustness-Oriented Training: Using adversarial or contrastive learning to improve stability against structural noise.
4. Hybrid Designs: Combining attention mechanisms (for robustness) with second-order interactions (for sensitivity).

In conclusion, this paper establishes a principled foundation for developing more expressive and reliable GNNs by integrating formal specification rigor into the evaluation process, revealing that current pooling strategies are insufficient for capturing complex, fine-grained graph properties at scale.