Can Vision-Language Models Solve the Shell Game?

This paper introduces VET-Bench, a diagnostic benchmark revealing that current Vision-Language Models fail at tracking visually identical objects due to an over-reliance on static features, and proposes Spatiotemporal Grounded Chain-of-Thought (SGCoT) to achieve over 90% accuracy by explicitly generating object trajectories as intermediate reasoning steps.

Tiedong Liu, Wee Sun Lee

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you are watching a street magician perform the classic "Shell Game." He places a ball under one of three identical cups, shuffles them around quickly, and asks, "Where is the ball?"

For a human, this is easy. We instinctively follow the ball's journey, keeping a mental note of its location as it moves. But for the most advanced AI models (Vision-Language Models or VLMs) of today, this simple trick is a nightmare. They are essentially guessing.

This paper, titled "Can Vision-Language Models Solve the Shell Game?", investigates why AI fails at this task and how the researchers fixed it.

Here is the breakdown in simple terms:

1. The Problem: AI is Cheating (and Getting Caught)

The researchers found that previous tests for AI video understanding were "rigged" in a way that let the AI cheat.

  • The Cheat: In many old videos, the cups weren't actually identical. One might be slightly shiny, or the ball might peek out for a split second. The AI didn't need to track the ball; it just needed to look at the final frame, spot the unique cup, and guess.
  • The Reality Check: The researchers built a new, stricter test called VET-Bench. They used perfectly identical cups and balls, ensuring the only way to solve the puzzle was to watch the movement from start to finish.
  • The Result: When they ran their top-tier AI models (like Gemini-3 and Qwen) on this strict test, they performed no better than random guessing. They were essentially flipping a coin.

2. The Theory: Why AI Struggles

The authors asked: Why can't these super-smart computers do this?

They compared the task to a complex puzzle that requires holding a changing state in your head.

  • The Analogy: Imagine trying to remember a phone number while someone keeps swapping the digits around in front of you. If you only look at the final result, you can't know the original order. You have to track every single swap.
  • The Limit: The paper proves mathematically that current AI models (which use a specific architecture called "Transformers") are like people with very short-term memory. They are great at looking at a single photo, but they struggle to connect the dots across a long sequence of events without help. They lack the "mental muscle" to track indistinguishable objects over time.

3. The Solution: Teaching AI to "Talk Through" the Problem

The researchers realized that asking the AI, "Where is the ball?" and expecting a direct answer was too hard. It's like asking a student to solve a math problem without showing their work.

They introduced a new method called SGCoT (Spatiotemporal Grounded Chain-of-Thought).

  • The Metaphor: Instead of just giving the answer, they forced the AI to act like a security guard with a clipboard.
    • Old Way: AI looks at the video and guesses "Left Cup." (Result: Wrong).
    • New Way (SGCoT): The AI is trained to first write down a step-by-step log: "At 0:01, the ball is under the middle cup. At 0:02, the middle cup swaps with the left cup. Now the ball is under the left cup..."
    • By forcing the AI to generate this "tracking log" (the Chain of Thought) before giving the final answer, it creates a bridge between the visual motion and the logic.

4. The Result: From Guessing to Mastering

By training an AI model (Molmo2) to generate these detailed tracking logs, the results were dramatic:

  • Before: The AI got it right about 33% of the time (random guessing).
  • After: The AI got it right over 90% of the time.

The Big Picture

This paper teaches us two important lessons:

  1. Current AI is "Myopic": Today's video AI is very good at recognizing what things are (a cup, a ball) but very bad at understanding where they are going over time. They rely on visual shortcuts rather than true tracking.
  2. Thinking Helps Seeing: Just like humans, AI needs to "think out loud" (generate intermediate steps) to solve complex tracking problems. If you force the AI to break the problem down into small, logical steps, it can master tasks that previously seemed impossible.

In short: The AI didn't get smarter at seeing; it got smarter at keeping track. And that's a huge step forward for robots that need to navigate the real world.