Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

Imagine you have a super-smart robot assistant (a Large Vision-Language Model, or LVLM) that can look at pictures and answer questions about them. You show it a diagram of a flowchart with circles (nodes) and arrows (edges) connecting them.

If you ask, "What color is the circle labeled 'A'?", the robot answers correctly almost every time.
But if you ask, "What color is the arrow pointing from A to B?" or "Which way is the arrow pointing?", the robot often gets it wrong or guesses randomly.

This paper is like a detective story where the authors try to figure out why this robot is so good at seeing the "dots" but so bad at understanding the "lines" connecting them. They did this by building a special "training gym" for the robot and then peeking inside its brain to see how it processes information.

Here is the breakdown of their findings using simple analogies:

1. The Setup: A Synthetic Playground

Real-world diagrams are messy. To get a clear answer, the authors didn't use real business charts or scientific graphs. Instead, they built a digital Lego set.

They created thousands of simple diagrams with colored shapes (nodes) and lines (edges).
They controlled every detail: the color, the shape, the direction of the arrows, and the layout.
This allowed them to ask very specific questions, like "Is the line between A and B red?" without the robot getting confused by messy handwriting or complex backgrounds.

2. The Investigation: Peeking Inside the Brain

The robot has two main parts in its brain:

The Vision Encoder: The "eyes" that look at the picture.
The Language Model: The "mind" that reads the question and thinks of an answer.

The authors used a technique called "Probing." Imagine you have a radio receiver. You can tune into different frequencies (layers of the brain) to see what kind of information is being broadcast at that moment. They asked: "At what point in the robot's brain does it clearly 'know' the color of the arrow?"

3. The Big Discovery: "Nodes Are Early, Edges Are Late"

This is the core finding, and it explains the robot's struggle.

The "Nodes" (The Dots) are Early Birds:

When the robot looks at the picture, the information about the dots (their color, shape, and how many there are) is immediately clear and organized in the "eyes" (Vision Encoder).
Analogy: It's like looking at a bowl of fruit. You can instantly see the red apple and the green pear. The information is right there, easy to grab, and "linearly separable" (meaning the robot can easily draw a line to separate "red" from "green" in its brain).

The "Edges" (The Arrows) are Late Bloomers:

The information about the lines and arrows is not clear in the "eyes." In the vision part of the brain, the arrow's color and direction are all jumbled up and mixed with the background. You can't easily pick them out.
Analogy: Imagine trying to find a specific thread in a tangled ball of yarn. In the "eyes," the thread is just part of the mess.
The Magic Happens Later: The arrow information only becomes clear and organized after the robot reads the question. It moves into the "mind" (Language Model). Once the robot reads "What color is the arrow between A and B?", it suddenly pulls the arrow information out of the mess and organizes it.
Why this matters: Because the robot has to wait until it reads the text to organize the arrow information, it has to do extra mental gymnastics. It has to combine the picture data with the text data to make sense of the relationship. This extra step is where the robot often trips up.

4. The Proof: The "Brain Surgery" Experiment

To prove that the robot actually uses this organized information, the authors performed a "causal intervention."

The Experiment: They found the specific parts of the robot's brain where the "dot" information was clear. Then, they essentially "scrambled" or erased that specific information before the robot gave its answer.
The Result: When they scrambled the "dot" info, the robot's answers about the dots became terrible.
The Contrast: When they tried to scramble the "arrow" info (which was still messy in the vision part), the robot's performance didn't change much. This confirmed that the robot wasn't relying on the messy vision data for arrows; it was waiting for the later "mind" processing to figure it out.

5. The Takeaway

The paper concludes that Large Vision-Language Models process different parts of a diagram in different ways:

Local things (like a single dot's color) are processed quickly and easily by the visual system.
Relational things (like the direction of an arrow connecting two dots) are hard. They require the robot to mix visual data with language data, which is a much more complex and abstract process.

In everyday terms:
The robot is great at spotting objects (like a red circle) but struggles with relationships (like an arrow pointing from one circle to another). It's like a person who can easily tell you the color of a car in a photo, but if you ask them to trace the exact path of a road connecting two cities on a map, they might get lost. The "path" requires more abstract thinking than just "seeing" the car.

This explains why AI is getting better at describing pictures but still struggles with complex logic puzzles involving diagrams. It's not that it can't "see" the lines; it's that it hasn't learned to organize the lines until it's already trying to speak.

1. Problem Statement

Large Vision-Language Models (LVLMs) have demonstrated strong performance on general diagram understanding benchmarks. However, they consistently struggle with relational understanding, specifically interpreting relationships defined by directed edges (arrows and lines), such as edge direction, connectivity, and multi-hop paths. While models can often identify individual nodes (e.g., color, shape), they fail to reliably reason about the connections between them.

The core problem addressed by this paper is the lack of understanding regarding how LVLMs internally represent different types of visual information (nodes vs. edges) and at which stage of the model's processing pipeline these representations become accessible. Prior work has focused on natural images; this paper targets the symbolic and structural nature of diagrams.

2. Methodology

The authors employ a combination of synthetic data generation, linear probing, and causal intervention to analyze the internal representations of LVLMs.

A. Synthetic Dataset Construction

To eliminate biases from natural images and language, the authors created a controlled synthetic dataset based on directed graphs:

Structure: Diagrams contain 5 nodes and directed edges.
Attributes: Nodes have specific colors (8 options) and shapes (5 options). Edges have colors, styles (solid/dashed), and directions.
Variants:
- $D_{rand}$ : Random node layouts to prevent the model from learning shortcuts based on fixed positions.
- $D_{fix}$ : Fixed node layouts to analyze spatial consistency.
- $D_{\perp}$ : Diagrams where the target node/edge does not exist, forcing the model to verify existence before predicting attributes (preventing shortcut learning).
Evaluation Aspects: 11 aspects categorized into:
- Single: Localized to one node (e.g., Node Color, Shape).
- Multiple: Requires combining two nodes (e.g., Edge Color, Direction, Existence).
- Global: Requires global context (e.g., Node Count, Multi-hop Path).

B. Probing Experiments

The authors train linear classifiers (probes) on the hidden states of the LVLM to determine if specific information is linearly separable (explicitly encoded) at different layers and positions.

Target Models: Primarily Qwen3-VL-8B, with additional tests on Qwen2.5-VL, LLaVA1.5, and Gemma3.
Probing Locations:
1. Vision Encoder: Hidden states of image patches.
2. Language Model (Image Input): Hidden states corresponding to image tokens after projection.
3. Language Model (Text Input): Hidden states corresponding to question tokens.

C. Causal Intervention

To verify if linearly separable representations are causally used for reasoning (rather than just being present), the authors perform activation patching:

They identify patches/tokens where probing accuracy exceeds a threshold.
They replace these hidden states with the mean vector of non-target patches (corrupting the specific information).
They measure the drop in VQA accuracy. A significant drop confirms that the corrupted information was causally necessary for the model's prediction.

3. Key Findings & Results

A. Divergent Encoding Stages: "Nodes Early, Edges Late"

The most significant finding is the temporal and spatial divergence in how information is encoded:

Node & Global Information: Information regarding node attributes (color, shape) and global structure (node count) becomes linearly separable early in the Vision Encoder. Specifically, it is encoded in single image patches corresponding to the node locations or distributed across background patches.
Edge Information: Edge attributes (color, style) and relational information (direction, existence) are not linearly separable in the Vision Encoder. They only become linearly decodable late in the pipeline, specifically within the text tokens of the Language Model (conditioned on the question).

B. Positional Analysis

Single/Global: In the Vision Encoder, node information is localized to the patch containing the node. Global information (like node count) is distributed broadly across background patches, suggesting the encoder uses background regions as aggregation points.
Multiple (Edges): In the Vision Encoder, edge information remains non-linearly separable. In the Language Model, accuracy for edge-related tasks spikes only at the specific text tokens that mention the edge (e.g., "edge between A and B"), indicating the model aggregates visual information into text tokens only when explicitly prompted.

C. Causal Verification

Node/Global: Corrupting the linearly encoded representations in the Vision Encoder (e.g., node color patches) causes a drastic drop in VQA accuracy. This proves the model causally relies on these early visual representations.
Edge/Direction: Corrupting the Vision Encoder has little to no effect on edge-related tasks. This suggests the model relies on non-linear representations or processes these relationships entirely within the language model's reasoning layers, which are harder to probe linearly.

D. Model Consistency

These trends were observed consistently across Qwen3-VL, Qwen2.5-VL, LLaVA1.5, and Gemma3, suggesting this is a fundamental architectural characteristic of current LVLMs rather than a model-specific quirk.

4. Key Contributions

Synthetic Diagram Dataset: Introduced a controllable, bias-free dataset ( $D_{rand}$ , $D_{fix}$ , $D_{\perp}$ ) specifically designed to isolate and evaluate the representation of diagrammatic elements (nodes, edges, global structure).
Discovery of Representation Latency: Provided empirical evidence that LVLMs process "node" information early (in the vision encoder) but delay "edge" relationship processing until the language model stage. This explains why models struggle with relational reasoning.
Causal Validation: Moved beyond correlation by using causal intervention to prove that the linearly separable representations found in the vision encoder are causally responsible for the model's ability to answer node-based questions.
Guidance for Future Systems: Suggests that improving diagram understanding requires architectural changes that facilitate earlier integration of relational (edge) information within the vision encoder, rather than relying solely on the language model to synthesize these relationships late in the pipeline.

5. Significance

This paper fundamentally shifts the understanding of LVLM limitations in diagram analysis. It moves the blame from "lack of data" or "insufficient training" to architectural bottlenecks in information integration.

The finding that edges are "late" implies that current LVLMs treat diagrams similarly to natural images (focusing on local features) and only attempt to reconstruct the graph structure via language-based reasoning. This "late fusion" approach is likely the root cause of failures in tasks requiring precise structural reasoning (e.g., flowcharts, circuit diagrams). The work provides a diagnostic framework for future LVLM designs, suggesting that early integration of relational constraints within the vision encoder is necessary to achieve true diagrammatic reasoning.