Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

By probing LVLMs with a synthetic directed graph dataset, this study reveals that while node and structural information are linearly encoded early in the vision encoder, edge representations emerge only later in the language model's text tokens, explaining the models' persistent struggles with relational understanding.

Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a super-smart robot assistant (a Large Vision-Language Model, or LVLM) that can look at pictures and answer questions about them. You show it a diagram of a flowchart with circles (nodes) and arrows (edges) connecting them.

If you ask, "What color is the circle labeled 'A'?", the robot answers correctly almost every time.
But if you ask, "What color is the arrow pointing from A to B?" or "Which way is the arrow pointing?", the robot often gets it wrong or guesses randomly.

This paper is like a detective story where the authors try to figure out why this robot is so good at seeing the "dots" but so bad at understanding the "lines" connecting them. They did this by building a special "training gym" for the robot and then peeking inside its brain to see how it processes information.

Here is the breakdown of their findings using simple analogies:

1. The Setup: A Synthetic Playground

Real-world diagrams are messy. To get a clear answer, the authors didn't use real business charts or scientific graphs. Instead, they built a digital Lego set.

  • They created thousands of simple diagrams with colored shapes (nodes) and lines (edges).
  • They controlled every detail: the color, the shape, the direction of the arrows, and the layout.
  • This allowed them to ask very specific questions, like "Is the line between A and B red?" without the robot getting confused by messy handwriting or complex backgrounds.

2. The Investigation: Peeking Inside the Brain

The robot has two main parts in its brain:

  • The Vision Encoder: The "eyes" that look at the picture.
  • The Language Model: The "mind" that reads the question and thinks of an answer.

The authors used a technique called "Probing." Imagine you have a radio receiver. You can tune into different frequencies (layers of the brain) to see what kind of information is being broadcast at that moment. They asked: "At what point in the robot's brain does it clearly 'know' the color of the arrow?"

3. The Big Discovery: "Nodes Are Early, Edges Are Late"

This is the core finding, and it explains the robot's struggle.

The "Nodes" (The Dots) are Early Birds:

  • When the robot looks at the picture, the information about the dots (their color, shape, and how many there are) is immediately clear and organized in the "eyes" (Vision Encoder).
  • Analogy: It's like looking at a bowl of fruit. You can instantly see the red apple and the green pear. The information is right there, easy to grab, and "linearly separable" (meaning the robot can easily draw a line to separate "red" from "green" in its brain).

The "Edges" (The Arrows) are Late Bloomers:

  • The information about the lines and arrows is not clear in the "eyes." In the vision part of the brain, the arrow's color and direction are all jumbled up and mixed with the background. You can't easily pick them out.
  • Analogy: Imagine trying to find a specific thread in a tangled ball of yarn. In the "eyes," the thread is just part of the mess.
  • The Magic Happens Later: The arrow information only becomes clear and organized after the robot reads the question. It moves into the "mind" (Language Model). Once the robot reads "What color is the arrow between A and B?", it suddenly pulls the arrow information out of the mess and organizes it.
  • Why this matters: Because the robot has to wait until it reads the text to organize the arrow information, it has to do extra mental gymnastics. It has to combine the picture data with the text data to make sense of the relationship. This extra step is where the robot often trips up.

4. The Proof: The "Brain Surgery" Experiment

To prove that the robot actually uses this organized information, the authors performed a "causal intervention."

  • The Experiment: They found the specific parts of the robot's brain where the "dot" information was clear. Then, they essentially "scrambled" or erased that specific information before the robot gave its answer.
  • The Result: When they scrambled the "dot" info, the robot's answers about the dots became terrible.
  • The Contrast: When they tried to scramble the "arrow" info (which was still messy in the vision part), the robot's performance didn't change much. This confirmed that the robot wasn't relying on the messy vision data for arrows; it was waiting for the later "mind" processing to figure it out.

5. The Takeaway

The paper concludes that Large Vision-Language Models process different parts of a diagram in different ways:

  • Local things (like a single dot's color) are processed quickly and easily by the visual system.
  • Relational things (like the direction of an arrow connecting two dots) are hard. They require the robot to mix visual data with language data, which is a much more complex and abstract process.

In everyday terms:
The robot is great at spotting objects (like a red circle) but struggles with relationships (like an arrow pointing from one circle to another). It's like a person who can easily tell you the color of a car in a photo, but if you ask them to trace the exact path of a road connecting two cities on a map, they might get lost. The "path" requires more abstract thinking than just "seeing" the car.

This explains why AI is getting better at describing pictures but still struggles with complex logic puzzles involving diagrams. It's not that it can't "see" the lines; it's that it hasn't learned to organize the lines until it's already trying to speak.