Graph Property Inference in Small Language Models: Effects of Representation and Inference Strategy

Imagine you have a small, smart assistant (a "Small Language Model" or SLM) who is very good at reading and writing stories but has never actually seen a map or a diagram. You want to ask this assistant to solve puzzles about a social network (who knows whom) or a road system (how cities are connected).

The problem is that the assistant can only read text, not pictures. So, you have to describe the network using words. The big question this paper asks is: "Does how we describe the network to the assistant matter, and does the way we ask it to think matter?"

Here is the breakdown of their findings, using some everyday analogies.

1. The Setup: Two Ways to Describe a Party

Imagine you are describing a party to your assistant. You need to tell it who is friends with whom. You have two ways to write this down:

The "Edge List" (The Random Chat Log): You write a long, messy list of every single conversation that happened.
- Example: "Alice talked to Bob. Bob talked to Charlie. Charlie talked to Dave. Dave talked to Alice..."
- The Problem: The information is scattered. To figure out who is in the middle of the group, the assistant has to jump back and forth through the whole list, like trying to find a specific person in a crowded room by reading a random list of names.
The "Adjacency List" (The Seating Chart): You group the information by person.
- Example: "Alice's friends: Bob, Dave. Bob's friends: Alice, Charlie..."
- The Advantage: The information is organized. The assistant can look at "Alice" and immediately see her whole circle. It's like looking at a seating chart where everyone's neighbors are right next to their name.

The Finding: The paper found that the Seating Chart (Adjacency List) is much better. When the assistant gets the organized list, it makes fewer mistakes and understands the "shape" of the group much better than when it gets the messy chat log.

2. The Thinking Style: How to Solve the Puzzle

Once the assistant has the data, how should it solve the puzzle? The researchers tested three methods:

The "Gut Check" (Baseline): The assistant looks at the text and immediately guesses the answer.
- Result: Often wrong, especially for complex questions.
The "Step-by-Step" (Chain-of-Thought): The assistant is asked to write down its thinking process before giving the answer. "First, I see Alice has 2 friends. Then I see Bob has 3..."
- Result: This helped a little bit, but not always. Sometimes the assistant got confused by its own long explanation.
The "Committee Meeting" (Graph-of-Thoughts / GoT): This is the winner. The assistant doesn't just think once. It generates 15 different possible answers (like 15 different people in a committee), and then it takes the median (the middle ground) of all those answers.
- Result: This was the most powerful method. By having the assistant "debate" with itself and then averaging the results, it became much more reliable. It's like asking 15 people to estimate the weight of a cow and taking the average; you get a much better number than if you just asked one person.

3. The Big Takeaway: It's Not Just About Being "Smarter"

You might think, "If we just make the AI bigger and smarter, it will solve these problems."

The paper says: Not necessarily.

Even with a "small" AI (which is like a smart high school student rather than a PhD professor), you can get great results if you:

Organize the data well (Use the Seating Chart, not the Chat Log).
Use the right thinking strategy (Let the AI "debate" itself with a committee approach).

The Verdict

Small language models can understand complex structures like graphs, but they are very sensitive to how you feed them information.

Bad Input + Bad Thinking: The assistant gets lost.
Good Input + Good Thinking: The assistant can surprisingly accurately guess things like "How many triangles are in this network?" or "Who is the most popular person?"

In short: If you want a small AI to be good at math or logic puzzles involving connections, don't just dump raw data on it. Organize the data neatly and ask it to think in groups, not just alone.

Here is a detailed technical summary of the paper "Graph Property Inference in Small Language Models: Effects of Representation and Inference Strategy" by Michal Podstawski.

1. Problem Statement

The paper addresses a critical gap in the capabilities of Small Language Models (SLMs) (specifically 3B parameter models) regarding structured reasoning. While Large Language Models (LLMs) have shown emergent abilities in reasoning, it remains unclear how effectively SLMs can infer formal properties of relational structures (graphs) when those structures are serialized into linear text.

Key challenges identified:

Linear vs. Structural: Transformers process linear token sequences, lacking explicit structural inductive bias. Graphs must be serialized into text (e.g., edge lists or adjacency lists), forcing the model to implicitly recover structural dependencies.
Representation Sensitivity: It is unknown how the choice of serialization format (how the graph is written) affects the model's ability to compute metrics like degree, clustering, or diameter.
Inference Strategy: It is unclear whether standard prompting, Chain-of-Thought (CoT), or more complex multi-branch strategies (Graph-of-Thoughts, GoT) are necessary for SLMs to handle distributed structural evidence.

2. Methodology

The authors conducted a controlled empirical study isolating two primary variables: Input Representation and Inference Strategy.

Models

Two instruction-tuned 3B parameter models: Llama-3.2-3B-Instruct and Qwen2.5-3B-Instruct.
Instruction-tuned models were selected to minimize formatting variance and isolate reasoning effects.

Dataset

TinyGraphEstimator: A benchmark dataset of undirected, unweighted, connected graphs of varying sizes.
Target Metrics: 12 heterogeneous graph properties were evaluated, including:
- Local/Global Statistics: Min/Max/Mean degree, degree standard deviation, graph density.
- Clustering: Average and global clustering coefficients.
- Path/Connectivity: Average shortest path length, diameter.
- Combinatorial Invariants: Triangle count, chromatic number.

Experimental Variables

Representation Formats:
- Adjacency List (Adj): Node-wise grouping of neighbors (e.g., Node 1: [2, 3]).
- Edge List (Edge): Ordered pairs of edges without grouping (e.g., 1-2, 1-3).
- Control: Both included explicit $(n, m)$ headers to control for graph size information.
Inference Strategies:
- Baseline: Direct prediction with greedy decoding.
- Chain-of-Thought (CoT): Structured reasoning prompts with greedy decoding.
- Graph-of-Thoughts (GoT): Multi-branch stochastic sampling ( $B=15$ branches, $T=0.7$ ) with median aggregation.

Evaluation Metrics

NRMSE $_{std}$ : Normalized Root Mean Squared Error (scaled by the target distribution's standard deviation) to allow comparison across heterogeneous metrics.
Spearman Rank Correlation ( $\rho$ ): Measures ordinal consistency (whether the model preserves the relative ordering of graphs).
Exact Match & Within-1 Accuracy: For discrete integer properties.

3. Key Results

A. Impact of Representation (Adjacency vs. Edge)

Adjacency Lists are Superior: The Adj format consistently outperformed the Edge format across both models and all strategies.
Error Reduction: For Qwen2.5-3B, using Adj reduced NRMSE $_{std}$ by 0.705 (Baseline) and 0.169 (GoT) compared to Edge.
Rank Consistency: Adjacency encoding significantly improved Spearman $\rho$ (up to +0.130 improvement), suggesting that grouping neighbors by node aligns better with transformer attention mechanisms than fragmented edge pairs.

B. Impact of Inference Strategy

GoT is Most Effective: Graph-of-Thoughts (multi-branch aggregation) yielded the largest macro-level improvements.
- For Qwen2.5-3B with Edge lists, GoT reduced error by 0.689 relative to Baseline.
- GoT achieved the highest rank correlation ( $\rho = 0.240$ ) for Qwen with Adjacency encoding.
CoT is Inconsistent: Chain-of-Thought reasoning showed variable results. It did not consistently outperform the Baseline and, in some cases (e.g., Llama with Edge), performed worse than direct prediction. This suggests that linear reasoning traces are less effective than multi-branch aggregation for integrating distributed structural cues.

C. Discrete Property Approximation

While exact match rates for discrete invariants (like Chromatic Number) were moderate (31.7% for Qwen), Within-1 accuracy was high (75.8%).
This indicates SLMs can approximate discrete structural invariants within bounded error when provided with structured inputs and GoT reasoning.
Performance was stronger for local features (degree, path length) than global transitive structures (clustering).

4. Key Contributions

Systematic Isolation of Factors: The study is one of the first to rigorously decouple the effects of serialization format and inference strategy specifically for SLMs on graph tasks.
Representation Design Guidelines: It establishes that Adjacency-list serialization is a critical lever for improving SLM performance on graph tasks, likely due to better alignment with attention locality.
Inference Strategy Validation: It demonstrates that multi-branch aggregation (GoT) is more effective than linear CoT for structural reasoning in SLMs, compensating for the models' limited capacity to integrate distributed information.
Nuanced Capability Assessment: The paper moves beyond "pass/fail" accuracy, showing that while SLMs struggle with precise magnitude estimation, they possess meaningful ordinal sensitivity and bounded approximation capabilities.

5. Significance and Implications

Practical Levers for SLMs: The findings provide actionable methods for improving structured inference in constrained environments (e.g., edge devices, cost-sensitive applications) without needing to scale up model parameters.
Redefining Structural Competence: The paper argues that "structural competence" in LLMs is not solely a function of model scale but is heavily dependent on how relational information is encoded and how predictions are elicited.
Future Directions: The results suggest that future work in graph reasoning should prioritize representational design (e.g., structured prompting) and aggregation mechanisms over simply increasing model size or relying on linear CoT.

Conclusion

The paper concludes that small language models can effectively infer graph properties if the input is serialized in a way that respects local neighborhood structure (Adjacency List) and if the inference strategy utilizes multi-branch aggregation (GoT). These design choices are critical for unlocking the structural reasoning potential of limited-capacity models.