UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

This paper introduces UDVideoQA, a large-scale, privacy-preserving benchmark dataset derived from real-world urban traffic footage that features 28K question-answer pairs for evaluating multi-object spatio-temporal reasoning, revealing a significant perception-reasoning gap in current models while demonstrating that fine-tuning smaller open-source models can achieve performance comparable to proprietary systems.

Joseph Raj Vishal, Nagasiri Poluri, Katha Naik, Rutuja Patil, Kashyap Hegde Kota, Krishna Vinod, Prithvi Jai Ramesh, Mohammad Farhadi, Yezhou Yang, Bharatesh Chakravarthi

Published 2026-02-25
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot how to drive a car through a busy city. You don't just want the robot to know what a stop sign looks like; you want it to understand why a car stopped, what the pedestrian was doing three seconds ago, and what would have happened if the light had turned green instead of red.

This paper introduces a new, massive "training gym" for these robot brains, called UDVideoQA.

Here is the breakdown of what they did, using simple analogies:

1. The Problem: The "Textbook" vs. The "Real World"

Most AI models today are like students who only study from perfect, clean textbooks. They are trained on short, curated clips where everything is clear, the lighting is perfect, and the actors know their lines.

  • The Reality: Real city traffic is messy. It's raining, people are jaywalking, cars are honking, and the lighting changes from bright noon to dark night.
  • The Gap: When you put these "textbook-trained" robots on a real street, they get confused. They might hallucinate (imagine things that aren't there) or miss simple details because they've never seen a messy intersection before.

2. The Solution: The "Urban Dynamics" Gym

The researchers built UDVideoQA, a giant dataset of 16 hours of real, unscripted traffic video from city intersections.

  • The Volume: It's like watching 1.7 million frames of video. That's a lot of traffic!
  • The Privacy Shield: Since they filmed real people, they had to protect their identities. Instead of just blurring faces (which can look weird and block the view), they used a special "motion-blur" technique.
    • Analogy: Imagine a painter who only paints over the parts of the canvas that are moving (like a walking person or a driving car) but leaves the static background (like the road, trees, and signs) perfectly clear. This keeps the scene looking real while protecting privacy.

3. The Test: 5 Levels of "Brain Power"

They didn't just ask simple questions like "Is there a car?" They created a hierarchy of questions to test how smart the AI really is. Think of it like a video game with five levels:

  1. Level 1: The Observer (Attribution): "What color is that car?" or "Is it raining?" (Basic facts).
  2. Level 2: The Narrator (Basic Understanding): "What is happening in this scene overall?" (Summarizing the vibe).
  3. Level 3: The Detective (Event Reasoning): "Why did the silver car brake?" (Connecting cause and effect).
  4. Level 4: The Time Traveler (Reverse Reasoning): "The pedestrian is halfway across the street. What was the traffic light doing before they stepped off the curb?" (Working backward).
  5. Level 5: The Philosopher (Counterfactual Inference): "If the light had been green, would the motorcycle have crashed?" (Testing "What if" scenarios without hallucinating).

4. The Results: The "Smart but Blind" Paradox

The researchers tested 10 of the world's most advanced AI models (like Gemini, GPT-4, and Qwen) on this new gym. The results were surprising:

  • The "Big Brain" Trap: The biggest, most expensive models (like Gemini Pro) were great at the "Philosopher" level. They could guess what might happen in a hypothetical scenario. However, they were terrible at the "Observer" level. They often couldn't tell if a car was silver or grey, or if the road was wet.
    • Analogy: It's like a brilliant professor who can write a thesis on traffic theory but fails to notice that the person standing right in front of them is wearing a red hat. They are smart, but they aren't "seeing" the video.
  • The "Small but Trained" Hero: A smaller, open-source model (Qwen 2.5-VL) started out average. But when the researchers gave it a "crash course" (fine-tuning) specifically on this messy traffic data, it became a superstar. It learned to see the details and reason about them, eventually beating the giant models in low-light conditions.

5. The New Challenge: "Ask the Right Question"

The paper also introduced a side challenge called VideoQGen. Instead of answering questions, the AI has to create them.

  • The Result: The best models could ask deep, complex questions. But many others just asked boring, repetitive questions like "Is it raining?" over and over. It showed that while AI is getting better at talking, it still struggles to be truly creative and diverse.

The Big Takeaway

This paper tells us that to make AI truly safe and useful for real-world tasks (like self-driving cars or city monitoring), we can't just feed them more data. We need to teach them to look closely (grounding) before they try to think deeply (reasoning).

UDVideoQA is the new standard tool to help developers fix this "blindness" and build AI that doesn't just guess, but actually sees and understands the chaotic, beautiful mess of our cities.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →