SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

Imagine you are looking at a busy street scene. Your brain instantly understands: "There's a man wearing a hat, standing on a sidewalk, next to a bus."

Scene Graph Generation (SGG) is the task of teaching a computer to do exactly that: turn a picture into a structured list of objects and their relationships.

However, current AI models are like a student who is very smart but gets distracted easily. They often:

Miss things: They see the man but forget the hat.
Guess wrong: They might say the man is riding the bus when he's just standing next to it.
Get stuck on the obvious: They only know common things like "person on chair" but fail to understand rare or complex interactions (like "person holding a specific type of tool").

The paper introduces SGG-R3, a new training method designed to fix these issues. Think of it as a three-step coaching program for an AI, moving it from a chaotic guesser to a structured detective.

Here is how SGG-R3 works, explained with simple analogies:

1. The Problem: The "Blank Page" Panic

Current AI models try to look at a picture and spit out the whole story at once. It's like asking a student to write a 10-page essay in one breath without an outline. They get overwhelmed, hallucinate (make things up), and miss details.

2. The Solution: The "Three-Stage Detective" (Structured Reasoning)

SGG-R3 forces the AI to break the job down into three strict steps, like a detective solving a case:

Stage 1: The List Maker (Category Detection)
- What it does: Before looking for details, the AI just lists what kinds of things are in the picture. "Okay, I see: a person, a car, a tree, and a building."
- Why it helps: It narrows the search. The AI doesn't waste energy looking for a "dog" if there are no dogs in the picture.
Stage 2: The Spotter (Instance Grounding)
- What it does: Now that it knows what to look for, it finds where they are. "Okay, there are two people: Person #1 and Person #2. Here are their exact locations."
- Why it helps: It prevents the AI from mixing up objects or missing duplicates.
Stage 3: The Connector (Relation Extraction)
- What it does: Finally, it connects the dots. "Person #1 is wearing a hat. Person #2 is standing on the sidewalk."
- Why it helps: By doing this last, the AI has a clear map of the scene to build relationships on, rather than guessing blindly.

3. The Secret Sauce: Fixing the "Rare Item" Problem

AI models are bad at rare things. If a dataset has 1,000 pictures of "cats on mats" but only 5 pictures of "cats on toasters," the AI will almost always guess "on mats."

SGG-R3 uses two clever tricks to fix this:

Trick A: The "Creative Writing" Homework (Relation Augmentation)
- The researchers used a super-smart AI (Qwen2.5-VL) to look at the pictures and write new stories about them.
- Analogy: Imagine a teacher giving a student a photo of a kitchen and asking, "What could be happening here?" The AI generates new, plausible relationships (e.g., "The spoon is in the cup") that weren't in the original textbook.
- They then filter these new stories to make sure they make sense, effectively giving the AI more practice examples for rare situations.
Trick B: The "Fair Grading" System (Dual-Granularity Reward)
- When the AI practices, it gets a score. Usually, the AI gets a high score just for getting the common things right (like "man on chair").
- SGG-R3 introduces a two-part grading system:
  1. Exact Match: Did you get the specific relationship right? (e.g., "Man wearing red shirt").
  2. Semantic Match: Did you get the vibe right? (e.g., If the AI said "Man dressed in red shirt," it still gets points because it's semantically similar).
- The Twist: The system gives extra bonus points for getting the rare, difficult relationships right. This forces the AI to stop ignoring the "long-tail" (rare) items and pay attention to them.

4. The Result: A Smarter, More Balanced AI

By combining these steps, SGG-R3 teaches the AI to:

Think step-by-step instead of guessing.
Learn from made-up examples to handle rare situations.
Get rewarded for being thorough, not just for being safe.

In a nutshell:
If traditional AI is like a student who memorizes the most common answers and guesses the rest, SGG-R3 is like a student who learns a strict study method, practices with a tutor who invents new scenarios, and gets graded fairly on both common and difficult questions. The result is a system that sees the whole picture, not just the obvious parts.

Here is a detailed technical summary of the paper "SGG-R3: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation."

1. Problem Statement

Scene Graph Generation (SGG) aims to parse images into structured graphs representing objects and their relationships. While Multimodal Large Language Models (MLLMs) have enabled end-to-end SGG, current approaches face two critical bottlenecks:

Lack of Structured Reasoning: Standard MLLMs often lack a task-specific reasoning pipeline. Without granular guidance, they struggle to navigate the vast search space of potential objects and relations, leading to hallucinations, low recall, and invalid output formats.
Data Sparsity and Long-Tail Bias: Relation data in standard benchmarks (e.g., VG150, PSG) is extremely sparse and follows a long-tail distribution. Most relations are "head" (frequent), while "tail" (rare) relations are underrepresented. This causes models to overfit to common relations, resulting in biased predictions and poor generalization to unseen or rare concepts.

2. Methodology: SGG-R3 Framework

The authors propose SGG-R3, a structured reasoning framework that integrates Chain-of-Thought (CoT) guided Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) with Group Sequence Policy Optimization (GSPO). The framework operates in three sequential stages:

A. Three-Stage Structured Reasoning

Instead of generating a scene graph in one pass, the model is trained to decompose the task into a rigorous, sequential pipeline:

Stage 1: Object Category Detection: The model first identifies which object categories exist in the image (e.g., "person," "car"). This narrows the search space and prevents the model from hallucinating non-existent categories.
Stage 2: Object Instance Grounding: Based on the detected categories, the model sequentially localizes every instance of each category (e.g., person.1, person.2) with bounding boxes.
Stage 3: Multi-type Relation Extraction: The model extracts triplets (Subject, Predicate, Object) based on the grounded instances. Relations are categorized into three types (e.g., Spatial, Possessive, Interactive) to ensure semantic diversity.

Output Format: Each stage is strictly encapsulated in dedicated XML tags (<CATEGORY>, <OBJECT>, <RELATION>) with JSON formatting to ensure structural integrity.

B. Type-Aware Relation Augmentation (for SFT)

To address data sparsity, the authors introduce a data augmentation strategy:

Generation: They leverage a powerful MLLM (Qwen2.5-VL-32B) with CoT prompts to generate plausible relation triplets for existing object instances.
Filtering: Generated triplets are filtered using Sentence-BERT embedding similarity. A generated triplet is retained only if its semantic embedding is sufficiently close to ground-truth triplets in the original dataset. This ensures the augmented data is semantically valid and diverse without introducing hallucinations.

C. Dual-Granularity Reward Scheme (for RL)

The Reinforcement Learning phase utilizes a novel reward system to optimize the reasoning process:

Fine-Grained Reward: Focuses on exact matching of triplets. It employs frequency-based adaptive weighting, assigning higher weights to rare (tail) predicates to mitigate long-tail bias.
Coarse-Grained Reward: Focuses on semantic coverage. It uses DBSCAN to cluster ground-truth relations into semantic prototypes. The model is rewarded for generating relations that are semantically close to these clusters, even if the exact subject/object or predicate doesn't match perfectly. This encourages generalization to novel or rare contexts.

Additional Rewards: Includes format rewards (valid JSON/tags) and instance grounding rewards (IoU + L1 distance + recall).

D. Optimization Algorithm

The framework uses Group Sequence Policy Optimization (GSPO). Unlike token-level optimization, GSPO operates at the sequence level, calculating the importance ratio over the entire generated output sequence. This stabilizes training for long-form, structured JSON outputs and reduces variance.

3. Key Contributions

Structured Reasoning Framework: A systematic decomposition of SGG into three sequential stages (Category $\to$ Instance $\to$ Relation), significantly improving logical coherence and controllability compared to end-to-end generation.
Relation Augmentation & Dual-Granularity Rewards: A novel combination of embedding-filtered data augmentation and a dual-granularity reward system that jointly addresses relation sparsity and long-tail distribution issues.
State-of-the-Art Performance: Demonstrates that a 3B parameter model (Qwen2.5-VL) trained with this framework outperforms larger models (7B, 32B) and existing non-VLM baselines.

4. Experimental Results

The method was evaluated on VG150 and PSG benchmarks:

Performance: SGG-R3 achieved superior results in Recall, mean Recall (mRecall), and Zero-Shot Recall (zsRecall) compared to both traditional non-VLM methods and other VLM-based approaches.
- On VG150, it achieved the highest mRecall among VLM-based methods (14.8% vs. baselines) and significantly improved zero-shot recall (6.1%).
- On PSG, it outperformed all multimodal methods and marginally exceeded non-VLM methods in Recall and mRecall.
Ablation Studies:
- Relation Augmentation (RA): Significantly boosted performance on tail predicates (e.g., +8.77% on VG150 tail predicates) and enabled zero-shot success on PSG where models without RA failed.
- Dual-Granularity Reward: Improved generalization and balanced performance across head, body, and tail categories.
Efficiency: The 3B model with SGG-R3 outperformed a 7B baseline (R1-SGG) by a significant margin (e.g., +6.91% Recall on VG150), proving that refined post-training is more effective than simply scaling model size.

5. Significance

Bridging Vision and Language: SGG-R3 demonstrates that MLLMs can be effectively adapted for dense, structured visual reasoning tasks by moving away from simple next-token prediction toward guided, multi-stage cognitive processes.
Solving Long-Tail Bias: The framework offers a robust solution to the inherent bias in visual datasets, enabling models to recognize and reason about rare and complex relationships that previous methods ignored.
Practical Applicability: By generating complete, unbiased scene graphs, the framework enhances downstream applications such as Visual Question Answering (VQA), image retrieval, and embodied AI navigation, where understanding the full context of a scene is critical.
Efficiency: It proves that smaller, well-trained models can outperform larger, less-structured models, suggesting a path toward more efficient and capable vision-language systems.

SGG-R3^{\rm 3}3: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

1. The Problem: The "Blank Page" Panic

2. The Solution: The "Three-Stage Detective" (Structured Reasoning)

3. The Secret Sauce: Fixing the "Rare Item" Problem

4. The Result: A Smarter, More Balanced AI

1. Problem Statement

2. Methodology: SGG-R3 Framework

A. Three-Stage Structured Reasoning

B. Type-Aware Relation Augmentation (for SFT)

C. Dual-Granularity Reward Scheme (for RL)

D. Optimization Algorithm

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers

SGG-R $^{\rm 3}$ : From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation