Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

The paper proposes Graph-of-Mark (GoM), a novel pixel-level visual prompting technique that overlays scene graphs onto images to capture object relationships, thereby significantly enhancing the spatial reasoning and zero-shot performance of multimodal language models.

Giacomo Frisoni, Lorenzo Molfetta, Mattia Buzzoni, Gianluca Moro

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to explain a complex scene to a friend who has never seen it before. You might say, "There's a cat on the sofa." But if you just hand them a photo of the room without pointing anything out, they might look at the cat, the sofa, and the lamp, but they might miss how they relate to each other. They might think the cat is floating or that the lamp is actually the cat's head.

This is exactly the problem researchers found with Multimodal Language Models (MLMs)—the super-smart AI systems that can "see" pictures and "read" text. While these AIs are great at recognizing objects (like "that's a toaster" or "that's a plant"), they often struggle with spatial reasoning. They tend to see a picture as a "bag of objects" rather than a connected world where things have positions, distances, and relationships (like "the plant is above the toaster").

The Old Way: The "Numbered Sticker" Approach

Previously, researchers tried to fix this using a method called Set-of-Mark (SoM). Imagine taking a photo and sticking numbered stickers on every object: "1" on the toaster, "2" on the plant. You then ask the AI, "Is object 2 below object 1?"

This helped the AI point to things, but it was like giving someone a map with dots but no roads. The AI knew where the dots were, but it didn't inherently understand the connection between them. It still had to guess if the plant was "above" or "below" the toaster just by looking at the pixels.

The New Way: Graph-of-Mark (GoM)

The authors of this paper propose a new, smarter way called Graph-of-Mark (GoM).

Think of GoM not just as putting stickers on a photo, but as drawing a connect-the-dots map with a legend directly onto the image.

  1. The Nodes (The Dots): Just like before, the AI identifies objects (the toaster, the plant).
  2. The Edges (The Lines): This is the magic part. GoM draws arrows between the objects.
    • If the plant is above the toaster, it draws an arrow pointing up with a label saying "Above."
    • If the plant is behind the toaster, it draws a line indicating depth.
    • It even adds little text boxes on the arrows to say exactly what the relationship is.

Why This is a Game-Changer

The paper argues that by drawing these relationships directly on the image, the AI doesn't have to "guess" the math of space anymore. The visual prompt literally shows the AI the answer.

  • Analogy Time: Imagine you are teaching a child to navigate a maze.
    • Old Method (SoM): You give them a map with dots labeled "Start," "Turn," and "Exit." They have to figure out which way to turn.
    • GoM Method: You give them the same map, but you also draw a bright, glowing arrow path from Start to Exit with signs saying "Turn Left Here." The path is now impossible to miss.

What They Found

The researchers tested this on three different AI models and four different types of visual puzzles (like answering questions about images or finding specific objects).

  • Better Accuracy: The models got significantly better at answering questions like "Is the plant below the oven?" (The answer went from being a guess to being almost always correct).
  • No Heavy Lifting: The best part? They didn't have to retrain the AI or teach it new math. They just changed the input (the picture they showed the AI). It's a "plug-and-play" upgrade.
  • Visual vs. Text: They found that drawing the graph on the image worked better than just describing the graph in text. The AI "sees" the relationship faster when it's drawn on the picture.

The Bottom Line

Graph-of-Mark is like giving an AI a pair of glasses that highlight not just what things are, but how they fit together. It turns a chaotic pile of objects into a structured, easy-to-understand story.

This is a big deal because it helps AI understand the world more like humans do—by seeing connections, not just isolated items. This could lead to better robots that don't knock over cups because they understand "above" and "below," or medical AI that can better understand the layout of organs in an X-ray.

In short: They stopped treating the image like a list of items and started treating it like a connected map, and the AI got much smarter at navigating that map.