Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

Imagine you are trying to explain a complex scene to a friend who has never seen it before. You might say, "There's a cat on the sofa." But if you just hand them a photo of the room without pointing anything out, they might look at the cat, the sofa, and the lamp, but they might miss how they relate to each other. They might think the cat is floating or that the lamp is actually the cat's head.

This is exactly the problem researchers found with Multimodal Language Models (MLMs)—the super-smart AI systems that can "see" pictures and "read" text. While these AIs are great at recognizing objects (like "that's a toaster" or "that's a plant"), they often struggle with spatial reasoning. They tend to see a picture as a "bag of objects" rather than a connected world where things have positions, distances, and relationships (like "the plant is above the toaster").

The Old Way: The "Numbered Sticker" Approach

Previously, researchers tried to fix this using a method called Set-of-Mark (SoM). Imagine taking a photo and sticking numbered stickers on every object: "1" on the toaster, "2" on the plant. You then ask the AI, "Is object 2 below object 1?"

This helped the AI point to things, but it was like giving someone a map with dots but no roads. The AI knew where the dots were, but it didn't inherently understand the connection between them. It still had to guess if the plant was "above" or "below" the toaster just by looking at the pixels.

The New Way: Graph-of-Mark (GoM)

The authors of this paper propose a new, smarter way called Graph-of-Mark (GoM).

Think of GoM not just as putting stickers on a photo, but as drawing a connect-the-dots map with a legend directly onto the image.

The Nodes (The Dots): Just like before, the AI identifies objects (the toaster, the plant).
The Edges (The Lines): This is the magic part. GoM draws arrows between the objects.
- If the plant is above the toaster, it draws an arrow pointing up with a label saying "Above."
- If the plant is behind the toaster, it draws a line indicating depth.
- It even adds little text boxes on the arrows to say exactly what the relationship is.

Why This is a Game-Changer

The paper argues that by drawing these relationships directly on the image, the AI doesn't have to "guess" the math of space anymore. The visual prompt literally shows the AI the answer.

Analogy Time: Imagine you are teaching a child to navigate a maze.
- Old Method (SoM): You give them a map with dots labeled "Start," "Turn," and "Exit." They have to figure out which way to turn.
- GoM Method: You give them the same map, but you also draw a bright, glowing arrow path from Start to Exit with signs saying "Turn Left Here." The path is now impossible to miss.

What They Found

The researchers tested this on three different AI models and four different types of visual puzzles (like answering questions about images or finding specific objects).

Better Accuracy: The models got significantly better at answering questions like "Is the plant below the oven?" (The answer went from being a guess to being almost always correct).
No Heavy Lifting: The best part? They didn't have to retrain the AI or teach it new math. They just changed the input (the picture they showed the AI). It's a "plug-and-play" upgrade.
Visual vs. Text: They found that drawing the graph on the image worked better than just describing the graph in text. The AI "sees" the relationship faster when it's drawn on the picture.

The Bottom Line

Graph-of-Mark is like giving an AI a pair of glasses that highlight not just what things are, but how they fit together. It turns a chaotic pile of objects into a structured, easy-to-understand story.

This is a big deal because it helps AI understand the world more like humans do—by seeing connections, not just isolated items. This could lead to better robots that don't knock over cups because they understand "above" and "below," or medical AI that can better understand the layout of organs in an X-ray.

In short: They stopped treating the image like a list of items and started treating it like a connected map, and the AI got much smarter at navigating that map.

Here is a detailed technical summary of the paper "Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting."

1. Problem Statement

Multimodal Language Models (MLMs) have achieved significant progress in understanding images and text, yet they struggle with spatial reasoning. Current state-of-the-art models often treat images as "bags of objects," failing to capture the geometric relationships (e.g., left/right, above/below, depth) between them.

Limitations of Existing Methods:
- Fine-tuning: Adapting models via retraining is computationally expensive and inflexible for new domains.
- Textual Prompts: Text alone struggles to convey fine-grained spatial information from images.
- Existing Visual Prompting (e.g., Set-of-Mark/SoM): While techniques like SoM overlay numeric IDs on objects to aid grounding, they treat objects as isolated entities. They fail to explicitly encode the relational structure (edges) that governs the scene, leaving a critical layer of spatial modeling untapped.

2. Methodology: Graph-of-Mark (GoM)

The authors propose Graph-of-Mark (GoM), a training-free, pixel-level visual prompting technique that overlays a Scene Graph (SG) directly onto the input image. This transforms the image from a collection of objects into a network of interconnected entities.

A. Pipeline Overview

Object Detection & Segmentation:
- Uses an ensemble of open-vocabulary detectors (OWL-V2, YOLOv8-X, Mask R-CNN) to identify objects and their bounding boxes.
- Refines boxes into precise segmentation masks using SAM-HQ (Segment Anything Model).
Relation Estimation (The Core Innovation):
- Constructs a graph where nodes are objects and edges represent spatial relations.
- Relation Ontology: Defines 7 relation types grouped into:
  - Directional: above, below, left_of, right_of.
  - Depth Stacking: in_front_of, behind (using monocular depth estimation via MiDaS).
  - Proximity: near.
- Modifiers: Adds granularity (e.g., touching, very_close) based on Intersection over Union (IoU) and normalized distance.
Filtering:
- To prevent visual clutter, the system filters objects and relations based on the user's text query.
- It retains only relevant objects (via semantic matching) and the top- $k$ relations per object, prioritizing query relevance and spatial proximity.
Scene Graph Rendering:
- Nodes: Objects are marked with colored masks and unique IDs (numeric or textual, e.g., oven_1).
- Edges: Spatial relations are visualized as directed arrows connecting objects.
- Labels: Edge labels (e.g., "Above", "Left Of") are rendered as text boxes near the arrows.
- Collision Avoidance: A novel allocation algorithm ensures marks (IDs, labels, arrows) do not overlap, using iterative displacement and dashed lines to connect displaced labels to their source.

B. Prompting Modes

GoM supports two modes of interaction:

Visual SG: The model receives the augmented image ( $I_{SG}$ ) and a plain text instruction. The model must infer relationships solely from the visual graph.
Visual + Textual SG: The model receives $I_{SG}$ plus a verbalized description of the graph in the text prompt ( $T_{SG}$ ), acting as a dual-modality reinforcement.

3. Key Contributions

First Pixel-Level Graph Prompting: GoM is the first technique to embed scene graphs directly into the image pixels for zero-shot spatial inference, moving beyond isolated object marking.
Training-Free & Plug-and-Play: It requires no model retraining or architectural changes, working with any existing MLM.
Depth-Aware Reasoning: Unlike previous methods, GoM integrates monocular depth estimation to infer 3D spatial relationships (front/behind) from 2D images.
Comprehensive Evaluation: The method was tested across 3 open-source MLMs (Gemma-3, Qwen-2.5-VL, LlamaV-o1) and 4 diverse datasets (GQA, VQAv1, VQAv2, RefCOCOg).

4. Experimental Results

The evaluation demonstrates that GoM consistently outperforms baselines (Raw Image, Segmentation-only, and Set-of-Mark).

Performance Gains:
- GoM improves zero-shot accuracy in Visual Question Answering (VQA) and Referring Expression Comprehension (REC) by up to 11 percentage points.
- LlamaV-o1 (11B) achieved the highest absolute scores (83.6% on VQA, 57.6% on REC), suggesting reasoning models benefit most from structured visual cues.
- Gemma-3 showed the most pronounced relative improvement, indicating GoM helps smaller models overcome spatial blind spots.
Ablation Studies:
- Graph Density: Optimal performance occurs with 3–10 entities and 4–16 relations. Excessive annotations introduce noise.
- Label Types: Textual object IDs generally perform better for VQA (abstractive answers), while numeric IDs are preferred for REC (localization tasks).
- Modality: Visual graph prompting alone yields the highest gains (+10% over text-only graph descriptions), proving that the visual structure itself activates latent reasoning capabilities.
Efficiency:
- GoM adds minimal overhead (~1.13 seconds per image vs. 0.77s for segmentation-only), a cost justified by the significant accuracy improvements.

5. Significance and Impact

Bridging the Spatial Gap: GoM addresses a fundamental weakness in MLMs—the inability to reason about object arrangements—without the prohibitive cost of retraining.
Interpretability: By visualizing the graph directly on the image, the reasoning process becomes more transparent and interpretable compared to latent graph fusion methods.
Real-World Applicability: The method is highly relevant for domains requiring precise spatial understanding, such as:
- Robotics & Navigation: Understanding "left of" or "behind" for manipulation.
- Medical Imaging: Analyzing relative positions of organs or lesions.
- GUI Agents: Navigating complex interfaces based on spatial layout.
Future Directions: The authors suggest extending GoM to hypergraphs for complex scenes, integrating stereo vision for better depth, and applying temporal modeling for video understanding.

In conclusion, Graph-of-Mark represents a paradigm shift in visual prompting, moving from simple object highlighting to relational graph embedding, thereby unlocking robust spatial reasoning capabilities in lightweight, open-source multimodal models.

Graph-of-Mark: Promote Spatial Reasoning in Multimodal Language Models with Graph-Based Visual Prompting

The Old Way: The "Numbered Sticker" Approach

The New Way: Graph-of-Mark (GoM)

Why This is a Game-Changer

What They Found

The Bottom Line

1. Problem Statement

2. Methodology: Graph-of-Mark (GoM)

A. Pipeline Overview

B. Prompting Modes

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

On the Multi-Commodity Flow with convex objective function: Column-Generation approaches

VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation

AnalogToBi: Device-Level Analog Circuit Topology Generation via Bipartite Graph and Grammar Guided Decoding

Artificial Intelligence (AI) Maturity in Small and Medium-Sized Enterprises: A Framework of Internalized and Ecosystem-Embedded Capabilities