SGG-R3^{\rm 3}: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

The paper introduces SGG-R3^{\rm 3}, a structured reasoning framework that combines chain-of-thought-guided supervised fine-tuning with relation augmentation and a novel dual-granularity reward scheme in reinforcement learning to achieve end-to-end unbiased Scene Graph Generation with improved recall and reduced bias on long-tailed distributions.

Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are looking at a busy street scene. Your brain instantly understands: "There's a man wearing a hat, standing on a sidewalk, next to a bus."

Scene Graph Generation (SGG) is the task of teaching a computer to do exactly that: turn a picture into a structured list of objects and their relationships.

However, current AI models are like a student who is very smart but gets distracted easily. They often:

  1. Miss things: They see the man but forget the hat.
  2. Guess wrong: They might say the man is riding the bus when he's just standing next to it.
  3. Get stuck on the obvious: They only know common things like "person on chair" but fail to understand rare or complex interactions (like "person holding a specific type of tool").

The paper introduces SGG-R3, a new training method designed to fix these issues. Think of it as a three-step coaching program for an AI, moving it from a chaotic guesser to a structured detective.

Here is how SGG-R3 works, explained with simple analogies:

1. The Problem: The "Blank Page" Panic

Current AI models try to look at a picture and spit out the whole story at once. It's like asking a student to write a 10-page essay in one breath without an outline. They get overwhelmed, hallucinate (make things up), and miss details.

2. The Solution: The "Three-Stage Detective" (Structured Reasoning)

SGG-R3 forces the AI to break the job down into three strict steps, like a detective solving a case:

  • Stage 1: The List Maker (Category Detection)
    • What it does: Before looking for details, the AI just lists what kinds of things are in the picture. "Okay, I see: a person, a car, a tree, and a building."
    • Why it helps: It narrows the search. The AI doesn't waste energy looking for a "dog" if there are no dogs in the picture.
  • Stage 2: The Spotter (Instance Grounding)
    • What it does: Now that it knows what to look for, it finds where they are. "Okay, there are two people: Person #1 and Person #2. Here are their exact locations."
    • Why it helps: It prevents the AI from mixing up objects or missing duplicates.
  • Stage 3: The Connector (Relation Extraction)
    • What it does: Finally, it connects the dots. "Person #1 is wearing a hat. Person #2 is standing on the sidewalk."
    • Why it helps: By doing this last, the AI has a clear map of the scene to build relationships on, rather than guessing blindly.

3. The Secret Sauce: Fixing the "Rare Item" Problem

AI models are bad at rare things. If a dataset has 1,000 pictures of "cats on mats" but only 5 pictures of "cats on toasters," the AI will almost always guess "on mats."

SGG-R3 uses two clever tricks to fix this:

  • Trick A: The "Creative Writing" Homework (Relation Augmentation)

    • The researchers used a super-smart AI (Qwen2.5-VL) to look at the pictures and write new stories about them.
    • Analogy: Imagine a teacher giving a student a photo of a kitchen and asking, "What could be happening here?" The AI generates new, plausible relationships (e.g., "The spoon is in the cup") that weren't in the original textbook.
    • They then filter these new stories to make sure they make sense, effectively giving the AI more practice examples for rare situations.
  • Trick B: The "Fair Grading" System (Dual-Granularity Reward)

    • When the AI practices, it gets a score. Usually, the AI gets a high score just for getting the common things right (like "man on chair").
    • SGG-R3 introduces a two-part grading system:
      1. Exact Match: Did you get the specific relationship right? (e.g., "Man wearing red shirt").
      2. Semantic Match: Did you get the vibe right? (e.g., If the AI said "Man dressed in red shirt," it still gets points because it's semantically similar).
    • The Twist: The system gives extra bonus points for getting the rare, difficult relationships right. This forces the AI to stop ignoring the "long-tail" (rare) items and pay attention to them.

4. The Result: A Smarter, More Balanced AI

By combining these steps, SGG-R3 teaches the AI to:

  • Think step-by-step instead of guessing.
  • Learn from made-up examples to handle rare situations.
  • Get rewarded for being thorough, not just for being safe.

In a nutshell:
If traditional AI is like a student who memorizes the most common answers and guesses the rest, SGG-R3 is like a student who learns a strict study method, practices with a tutor who invents new scenarios, and gets graded fairly on both common and difficult questions. The result is a system that sees the whole picture, not just the obvious parts.