Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

This paper introduces Synthetic Visual Genome 2 (SVG2), a large-scale automated panoptic video scene graph dataset with over 636K videos, and presents TRaSER, a novel model that leverages trajectory-aligned token mechanisms to significantly outperform existing baselines in scene graph generation and downstream video question answering tasks.

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you are watching a busy street scene in a video. A human brain doesn't just see "a car" and "a person." It instantly understands a complex story: "The red car is speeding past the tall man, who is looking at his phone while waiting for the green light."

For decades, computers have been terrible at this. They can recognize objects, but they struggle to understand the relationships, the timing, and the details all at once. They get lost in the chaos of a moving video.

This paper introduces SVG2 (Synthetic Visual Genome 2) and TraSeR, a new system designed to teach computers how to "read" a video like a human does. Here is the breakdown in simple terms.

1. The Problem: The "Blind" Computer

Think of existing video datasets as a library with only a few thousand books, and those books are missing half their pages.

  • Too Small: Old datasets had very few videos.
  • Too Lazy: They often only described the first second of a video, missing what happened later.
  • Too Expensive: To fix this, humans would have to watch millions of hours of video and write down every detail (e.g., "blue shirt," "running," "holding a cup"). This is too slow and expensive to do manually.

2. The Solution: The "Robot Librarian" (SVG2)

The authors built a fully automated pipeline to create a massive new library called SVG2. Instead of hiring humans, they built a team of AI "robots" that work together to watch videos and write the story.

  • The Tracker (The Eye): First, they use a robot (SAM2) that acts like a super-accurate highlighter. It draws a mask around every single object in every frame of the video, even if the object disappears behind a wall and comes back. It keeps a perfect "ID card" for every object so it knows, "That's still the same red car."
  • The Describer (The Voice): Next, another robot (DAM) looks at each highlighted object and writes a detailed description. Instead of just "dog," it writes "a small, fluffy, golden dog."
  • The Storyteller (The Brain): Finally, a powerful AI (GPT-5) looks at the whole scene and figures out the relationships. It connects the dots: "The dog is chasing the ball," or "The ball is under the bench."

The Result: They created a dataset with 636,000 videos and millions of details. It's like turning a small pamphlet into a 100-volume encyclopedia of video life. They even had a few humans double-check the work, and the robots were right 93% of the time.

3. The New Model: TraSeR (The "Smart Reader")

Now that they have this massive library, they needed a student to learn from it. They built a new AI model called TraSeR.

Most AI models try to watch a video like a human watches a movie—frame by frame, getting overwhelmed by the sheer amount of data. TraSeR is different. It uses a clever trick called "Token Resampling."

  • The Analogy: Imagine trying to read a 10-hour movie script. If you read every single word, you'll get tired and forget the beginning by the time you reach the end.
  • TraSeR's Trick: Instead of reading every word, TraSeR creates a "summary card" for each character.
    • The Global Card: It summarizes the whole life of an object (e.g., "The man was wearing a hat the whole time").
    • The Moment Card: It also zooms in on short, specific moments to catch quick actions (e.g., "At second 14, the man dropped the hat").

By organizing the video data this way, TraSeR can process the whole story in one go without getting confused.

4. Why This Matters: The "Superpower"

The paper tested TraSeR on several tasks, and the results were impressive:

  • Better Detection: It found objects and relationships much better than previous open-source models (improving by 15–40%).
  • Beating the Giants: It even beat the most advanced commercial AI models (like GPT-5) at predicting what objects are doing and what they look like.
  • The "Cheat Sheet" Effect: The most exciting part? When they fed TraSeR's "story notes" (the scene graph) to another AI to answer questions, that AI got smarter.
    • Without notes: "What is the man doing?" -> "I think he's walking." (Maybe wrong).
    • With notes: "The man is walking while holding a coffee cup." -> "He is walking and drinking coffee." (Correct).

Summary

Think of SVG2 as a massive, perfectly organized library of video stories written by robots. TraSeR is the brilliant student who learned to read that library so well that it can now understand videos better than almost any other computer.

This is a huge step forward because it moves computers from just "seeing" pixels to truly understanding the dynamic, moving world around them. It's the difference between a security camera that just records footage and a security guard who can tell you exactly what happened, who did it, and why.