Can Vision-Language Models Solve the Shell Game?

Imagine you are watching a street magician perform the classic "Shell Game." He places a ball under one of three identical cups, shuffles them around quickly, and asks, "Where is the ball?"

For a human, this is easy. We instinctively follow the ball's journey, keeping a mental note of its location as it moves. But for the most advanced AI models (Vision-Language Models or VLMs) of today, this simple trick is a nightmare. They are essentially guessing.

This paper, titled "Can Vision-Language Models Solve the Shell Game?", investigates why AI fails at this task and how the researchers fixed it.

Here is the breakdown in simple terms:

1. The Problem: AI is Cheating (and Getting Caught)

The researchers found that previous tests for AI video understanding were "rigged" in a way that let the AI cheat.

The Cheat: In many old videos, the cups weren't actually identical. One might be slightly shiny, or the ball might peek out for a split second. The AI didn't need to track the ball; it just needed to look at the final frame, spot the unique cup, and guess.
The Reality Check: The researchers built a new, stricter test called VET-Bench. They used perfectly identical cups and balls, ensuring the only way to solve the puzzle was to watch the movement from start to finish.
The Result: When they ran their top-tier AI models (like Gemini-3 and Qwen) on this strict test, they performed no better than random guessing. They were essentially flipping a coin.

2. The Theory: Why AI Struggles

The authors asked: Why can't these super-smart computers do this?

They compared the task to a complex puzzle that requires holding a changing state in your head.

The Analogy: Imagine trying to remember a phone number while someone keeps swapping the digits around in front of you. If you only look at the final result, you can't know the original order. You have to track every single swap.
The Limit: The paper proves mathematically that current AI models (which use a specific architecture called "Transformers") are like people with very short-term memory. They are great at looking at a single photo, but they struggle to connect the dots across a long sequence of events without help. They lack the "mental muscle" to track indistinguishable objects over time.

3. The Solution: Teaching AI to "Talk Through" the Problem

The researchers realized that asking the AI, "Where is the ball?" and expecting a direct answer was too hard. It's like asking a student to solve a math problem without showing their work.

They introduced a new method called SGCoT (Spatiotemporal Grounded Chain-of-Thought).

The Metaphor: Instead of just giving the answer, they forced the AI to act like a security guard with a clipboard.
- Old Way: AI looks at the video and guesses "Left Cup." (Result: Wrong).
- New Way (SGCoT): The AI is trained to first write down a step-by-step log: "At 0:01, the ball is under the middle cup. At 0:02, the middle cup swaps with the left cup. Now the ball is under the left cup..."
- By forcing the AI to generate this "tracking log" (the Chain of Thought) before giving the final answer, it creates a bridge between the visual motion and the logic.

4. The Result: From Guessing to Mastering

By training an AI model (Molmo2) to generate these detailed tracking logs, the results were dramatic:

Before: The AI got it right about 33% of the time (random guessing).
After: The AI got it right over 90% of the time.

The Big Picture

This paper teaches us two important lessons:

Current AI is "Myopic": Today's video AI is very good at recognizing what things are (a cup, a ball) but very bad at understanding where they are going over time. They rely on visual shortcuts rather than true tracking.
Thinking Helps Seeing: Just like humans, AI needs to "think out loud" (generate intermediate steps) to solve complex tracking problems. If you force the AI to break the problem down into small, logical steps, it can master tasks that previously seemed impossible.

In short: The AI didn't get smarter at seeing; it got smarter at keeping track. And that's a huge step forward for robots that need to navigate the real world.

Here is a detailed technical summary of the paper "Can Vision-Language Models Solve the Shell Game?"

1. Problem Statement

The paper identifies Visual Entity Tracking (VET) as a critical bottleneck in current Vision-Language Models (VLMs). While humans and some animals can effortlessly track indistinguishable objects over time (e.g., the "Shell Game" or "Three-Card Monte"), current state-of-the-art VLMs fail significantly at this task.

The authors argue that existing video benchmarks (like the Perception Test) are flawed because they contain visual shortcuts. In many test cases, objects have distinct appearances (e.g., different colors, transparency, or unique markings) or the video cuts reveal the answer, allowing models to solve the task via static frame-level re-identification rather than genuine spatiotemporal tracking. When these shortcuts are removed, VLM performance collapses to near-random guessing, revealing a fundamental inability to maintain entity representations over time.

2. Methodology

A. VET-Bench: A Synthetic Diagnostic Testbed

To rigorously evaluate VET capabilities, the authors introduce VET-Bench, a synthetic benchmark designed to eliminate appearance-based shortcuts.

Design: It features visually identical objects (e.g., opaque cups, face-down cards) undergoing positional swaps.
Constraints:
- Indistinguishability: Objects have identical textures, colors, and materials.
- Continuity: Swaps are constrained by a Nyquist-like criterion ($2d < \Delta$) to ensure objects do not cross paths ambiguously between frames.
- No Cues: No static frame reveals the target's identity or the swap operation; tracking must rely solely on motion continuity.
Tasks: Includes "Cups Game" (tracking a ball under cups) and "Cards Game" (tracking a specific card).

B. Theoretical Analysis

The authors provide a theoretical proof linking VET to computational complexity:

Complexity Class: They prove that tracking $k \ge 5$ indistinguishable objects is NC1-complete.
Implication: This class of problems requires intermediate computation (state tracking) that exceeds the expressivity of fixed-depth transformers without explicit intermediate supervision.
Contrast: If objects have unique visual identifiers, the problem collapses to a parallelizable visual search (AC0), explaining why models succeed on easier benchmarks but fail on VET-Bench.

C. SGCoT: Spatiotemporal Grounded Chain-of-Thought

To overcome the limitations of direct-answer supervision, the authors propose SGCoT:

Concept: Instead of asking the model for a final answer directly, the model is prompted to generate an explicit intermediate state representation in the form of object trajectories before answering.
Implementation:
- They utilize Molmo2, a model with strong native object tracking capabilities.
- Alignment: They fine-tune Molmo2 on synthetic text-only data (no video input during training) where the "thought" process is a structured trajectory log (timestamps and coordinates) followed by the answer.
- Format: The model outputs a <tracks> tag containing normalized coordinates $(x, y)$ at specific time intervals, effectively grounding the reasoning in spatiotemporal data.
Advantage: This forces the model to perform "physical CoT," explicitly updating the state of the object at each step, mitigating the accumulation of errors seen in coarse descriptive reasoning.

3. Key Contributions

Benchmark Creation: Introduced VET-Bench, a rigorous synthetic benchmark that isolates spatiotemporal perception from static appearance cues, exposing the tracking failure modes of current VLMs.
Comprehensive Evaluation: Evaluated a wide range of proprietary (Gemini-3-Pro, Doubao, Kimi) and open-source (Qwen, GLM, Molmo2) models. All tested models performed at or near random chance (approx. 30-37% for 3 objects) on VET-Bench.
Theoretical Proof: Established that visual entity tracking for indistinguishable objects is NC1-complete, proving that fixed-depth transformers are theoretically limited in solving these tasks without intermediate reasoning steps.
SGCoT Solution: Demonstrated that by aligning models to generate spatiotemporal trajectories as intermediate states, VLMs can achieve >90% accuracy on VET-Bench, solving the task end-to-end without external tools.

4. Experimental Results

Baseline Performance: On VET-Bench (3 objects, 5 swaps), top-tier models like Gemini-3-Pro dropped from 80% on the unfiltered Perception Test to 36.45% on the filtered subset, and 30.77% on the strict 3-cup setting (random baseline is 33.3%).
Failure Modes:
- Direct Answer: Models guess randomly.
- Coarse Description: Models describe the scene generally but fail to track specific swaps.
- Hallucination: Models generate logically coherent but visually incorrect swap sequences (e.g., hallucinating swaps that didn't happen).
SGCoT Performance: The fine-tuned Molmo2-SGCoT achieved 91% accuracy on VET-Bench.
Ablation: Training on direct answers (without CoT) failed to improve performance even after 60 epochs, confirming that the task requires intermediate state representation to be learnable.

5. Significance

This work highlights a fundamental gap in current VLMs: the inability to perform fine-grained spatiotemporal reasoning when visual shortcuts are removed.

Theoretical Insight: It connects video understanding to formal complexity theory, suggesting that "intuitive" human tasks like the shell game are computationally hard for standard transformer architectures without explicit state-tracking mechanisms.
Practical Impact: The proposed SGCoT method offers a scalable path forward. By transforming visual perception into a reasoning process with explicit intermediate states, VLMs can reliably solve complex tracking tasks essential for embodied AI, robotics, and game-playing agents.
Future Direction: The paper suggests that future VLMs must integrate "physical CoT" (explicit state updates) to achieve human-level video understanding, moving beyond reliance on static frame features.