GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback

The Big Picture: A Detective and a Librarian

Imagine you are trying to solve a mystery in a huge library (the Graph). The books (nodes) are connected by shelves and aisles (edges). Each book has a long, complex story written inside it (the Text).

Your goal is to figure out what genre each book belongs to (e.g., Mystery, Sci-Fi, Biography). However, there's a catch: You only have a few books with their genre labels already written on the spine. The rest are unlabeled. This is the "Low-Resource" problem.

You have two experts to help you:

The Librarian (The LLM): This expert has read millions of books. They are amazing at understanding the stories inside the books. If you show them a page of text, they can guess the genre perfectly. But, they don't know how the library is organized. They don't know that books on the same shelf usually belong to the same genre.
The Detective (The GNN): This expert has never read a single book. They only look at the shelves and aisles. They know that if a book is sitting next to three "Mystery" books, it's probably a "Mystery" too. They are great at spotting patterns in the layout, but they can't read the stories.

The Problem: Why They Struggle Alone

The Librarian's Mistake: If you ask the Librarian to guess the genre of the unlabeled books, they might get it right based on the text. But because they don't know the library layout, they might guess randomly for books that look similar but are in different sections. Also, they might get "confident" but wrong guesses (hallucinations).
The Detective's Mistake: If you ask the Detective, they will guess based on neighbors. But if the neighbors are also unlabeled or if the layout is tricky, they might spread the wrong genre to the whole shelf.

The Challenge: You need to teach the Librarian to use the Detective's map, but you don't have enough labeled books to train them properly. If you just let the Librarian guess and then teach them their own guesses, they might just learn their own mistakes.

The Solution: GNN-as-Judge

The authors propose a new system called GNN-as-Judge. Think of this as a Collaborative Training Camp where the Detective acts as a strict judge to help the Librarian learn.

Here is how the camp works in three steps:

Step 1: Picking the Right Students (Influence-Guided Selection)

You can't teach the Librarian about every unlabeled book; there are too many. You need to pick the most important ones.

The Metaphor: Imagine the Detective walks through the library and points to the books that are most "influenced" by the few labeled books you already have. These are the books sitting right next to the known genres.
The Action: The system picks these "influential" books first. These are the best candidates for learning because the Detective's map gives them a strong hint about what they should be.

Step 2: The "Agree or Disagree" Game (Collaborative Labeling)

Now, the Librarian and the Detective both look at the selected books and guess the genre.

The "Easy" Books (Agreement): Sometimes, the Librarian and the Detective both guess "Sci-Fi."
- The Metaphor: This is like two experts nodding in agreement. We are very confident this is right. We treat this as a Gold Standard fact.
The "Hard" Books (Disagreement): Sometimes, the Librarian says "Sci-Fi" but the Detective says "Mystery."
- The Metaphor: This is a debate! The Librarian might be wrong because they are ignoring the shelf layout. The Detective might be right because they see the pattern.
- The Judge's Role: The system uses the Detective's "confidence score" to decide who is likely right. If the Detective is very sure about the "Mystery" label, we trust the Detective over the Librarian for this specific book. This helps us find the "hard" examples that the Librarian usually gets wrong.

Step 3: The Special Training (Weakly-Supervised Fine-Tuning)

Now we teach the Librarian using these two groups of books, but we treat them differently.

For the "Easy" (Agreement) Books: We use Instruction Tuning.
- The Metaphor: It's like a teacher saying, "You got this right! Remember this rule." We reinforce the correct answer.
For the "Hard" (Disagreement) Books: We use Preference Tuning.
- The Metaphor: This is the clever part. Instead of just saying "You are wrong, the answer is Mystery," we say, "Look, the Detective thinks it's Mystery, and you think it's Sci-Fi. Based on the evidence (the shelf), the Detective's answer is better."
- We don't force the Librarian to memorize the answer blindly. We teach them to prefer the Detective's logic over their own initial guess. This helps the Librarian learn why they were wrong without getting confused by noisy data.

Why This is a Big Deal

It solves the "Low Data" problem: It works even when you only have a tiny number of labeled books (3 to 10 per genre).
It stops the Librarian from lying to themselves: By using the Detective as a judge, the system filters out the Librarian's confident but wrong guesses.
It learns from mistakes: Most systems only learn from what they get right. This system specifically targets the "Hard" disagreements to teach the Librarian how to fix its blind spots.

The Result

In the experiments, this "Collaborative Camp" (GNN-as-Judge) beat all other methods. It turned the Librarian into a super-expert who can read the stories and understand the library layout, even when they started with very little information.

In short: It's a team-up where a text-expert (LLM) and a structure-expert (GNN) play a game of "Guess the Genre," and the structure-expert acts as a referee to correct the text-expert, making them much smarter than they could ever be alone.

1. Problem Statement

The paper addresses the challenge of few-shot semi-supervised node classification on Text-Attributed Graphs (TAGs).

Context: TAGs consist of nodes with textual features and edges representing relationships (e.g., citation networks). While Large Language Models (LLMs) excel at understanding text, they struggle in low-resource settings (where labeled nodes are scarce) because they lack the structural inductive bias of Graph Neural Networks (GNNs).
Limitations of Existing Methods:
- LLM-as-Predictors: Directly using LLMs for prediction often fails in few-shot scenarios due to insufficient supervision signals and an inability to leverage unlabeled node structures effectively.
- Pseudo-labeling Issues: Standard self-training approaches rely on "easy" high-confidence pseudo-labels, which provide limited learning signals. Conversely, "hard" samples are informative but prone to high label noise.
- The Core Challenge: How to generate reliable pseudo-labels that incorporate both semantic (text) and structural (graph) information, and how to fine-tune LLMs on these noisy pseudo-labels without degrading performance.

2. Methodology: GNN-as-Judge

The authors propose GNN-as-Judge, a framework that treats a GNN as a "judge" to guide an LLM in generating reliable pseudo-labels and fine-tuning. The framework consists of three core components:

A. Influence-Guided Node Selection

To avoid computational bottlenecks of pseudo-labeling the entire graph, the method first selects a subset of the most influential unlabeled nodes.

Mechanism: It leverages the GNN's message-passing capability to quantify Node Influence.
Definition: The influence of a labeled node $v_i$ on an unlabeled node $v_j$ is defined by the Jacobian of the final node representations.
Selection: Unlabeled nodes are ranked by their maximum influence from any labeled node. The top- $K$ nodes are selected for pseudo-labeling, ensuring the selected nodes are most representative of the labeled distribution and receive strong structural signals.

B. Collaborative Pseudo-Label Selection

The framework partitions the selected nodes into two sets based on the agreement between the LLM and the GNN:

Agreement Set ( $V_{agreed}$ ): Nodes where the LLM and GNN predictions match.
- Theoretical Guarantee: The paper proves (Theorem 2) that under conditional independence, the accuracy of this set is strictly higher than the accuracy of either model individually. These serve as high-quality "easy" examples.
Disagreement Set ( $V_{disagreed}$ ): Nodes where predictions differ.
- Strategy: Instead of discarding these, the GNN acts as a judge. Since the GNN has structural inductive bias, its prediction is assumed more reliable for these "hard" cases.
- Filtering: A Preference Score ( $S_{pref}$ ) is calculated based on the GNN's probability confidence in its own prediction versus the LLM's prediction. Only nodes where the GNN's confidence exceeds a threshold $\tau$ are retained. These serve as challenging "hard" examples.

C. Weakly-Supervised Fine-Tuning Algorithm

A novel training objective is designed to fine-tune the LLM using both sets while mitigating label noise:

Instruction Tuning (on $V_{agreed}$ ): Standard supervised fine-tuning (SFT) is applied to reinforce correct predictions where both models agree.
Preference Tuning (on $V_{disagreed}$ ): Instead of treating the GNN's label as a hard ground truth (which might still be noisy), the method uses Preference Optimization (specifically ORPO - Odds Ratio Preference Optimization).
- The GNN's prediction is treated as the preferred response ( $y_w$ ).
- The LLM's original prediction is treated as the dispreferred response ( $y_l$ ).
- The LLM learns to increase the relative likelihood of the GNN's prediction over its own, effectively distilling structural knowledge without overfitting to potentially incorrect absolute labels.

3. Key Contributions

Novel Framework: Introduces GNN-as-Judge, the first framework to systematically use GNNs as judges to select and filter pseudo-labels for LLMs in few-shot TAG settings.
Theoretical & Practical Selection: Provides a theoretical bound on the accuracy of agreement sets and proposes a practical mechanism to identify "hard" but reliable samples via GNN preference scores.
Noise-Mitigated Training: Develops a unified weakly-supervised fine-tuning algorithm combining instruction tuning and preference tuning (ORPO) to handle label noise in disagreement sets effectively.
Comprehensive Evaluation: Extensive experiments across multiple datasets (Cora, Citeseer, Pubmed, ogbn-arxiv, ogbn-products) demonstrating superior performance over state-of-the-art baselines.

4. Experimental Results

Performance: GNN-as-Judge significantly outperforms traditional GNNs (GCN, SGC), pure LLM approaches (Zero-shot, GraphGPT, LLaGA), and hybrid methods (GLEM, TAPE) across all datasets.
- Example: On ogbn-arxiv (3-shot), GNN-as-Judge achieves 62.21% accuracy, compared to 38.33% for GCN and 50.18% for Zero-shot LLM.
- Example: On Cora (3-shot), it achieves 77.89%, outperforming the next best (TAPE at 73.71%).
Low-Resource Robustness: The method shows the most significant gains in extreme low-resource settings (3-shot and 5-shot), proving its efficacy when labeled data is scarce.
Zero-Shot Generalization: In cross-dataset zero-shot tasks (training on ogbn-arxiv, testing on Cora/Citeseer/Pubmed), GNN-as-Judge maintains high accuracy (e.g., 68.27% on Cora), far surpassing other LLM-graph methods which struggle with distribution shifts.
Ablation Studies:
- Removing pseudo-labels causes significant performance drops.
- Removing the Disagreement Node Set leads to performance degradation, confirming the value of "hard" samples.
- Replacing the weakly-supervised preference tuning with standard instruction tuning reduces performance, highlighting the importance of noise mitigation.

5. Significance

Bridging the Gap: The paper successfully bridges the gap between the semantic power of LLMs and the structural efficiency of GNNs, specifically addressing the critical bottleneck of data scarcity in graph learning.
Paradigm Shift: It moves beyond the "LLM-as-Predictor" or "LLM-as-Enhancer" paradigms to a "LLM-as-Learner with GNN-as-Judge" paradigm, offering a new direction for semi-supervised learning on graphs.
Practical Applicability: The framework is highly practical for real-world applications (e.g., citation networks, e-commerce) where labeled data is expensive to obtain, providing a robust solution that generalizes well across different graph scales and structures.
Noise Handling: The integration of preference tuning (ORPO) for handling noisy pseudo-labels offers a generalizable technique for training LLMs on imperfect data, extending beyond just graph tasks.