See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay

Imagine you are teaching a brilliant but slightly clumsy robot to play video games. This robot has a super-powered brain (a Vision-Language Model) that can describe what it sees in great detail, but it's terrible at actually doing things based on what it sees. It's like a person who can write a beautiful poem about a soccer ball but can't kick it into the goal.

The paper "See, Symbolize, Act" asks a simple question: If we give this robot a cheat sheet with exact numbers (like "the ball is at X, Y"), will it play better?

Here is the breakdown of their findings using some everyday analogies.

1. The Three Ways to Play

The researchers tested the robots (AI models like Claude, GPT-4o, and Gemini) using four different "playbooks":

The "Blind" Player (Frame Only): The robot just looks at the screen. It has to guess where everything is.
- Analogy: Trying to drive a car while wearing foggy glasses. You can see shapes, but you can't be sure where the curb is.
The "Cheat Sheet" Player (Ground Truth Symbols): The robot gets the screen plus a perfect list of coordinates from the game's computer memory.
- Analogy: Driving with foggy glasses, but a GPS voice in your ear screaming, "Turn left in 10 feet! Obstacle at 5 feet!"
The "Self-Describer" Player (Self-Extracted Symbols): The robot looks at the screen, tries to write down its own list of coordinates, and then uses that list to play.
- Analogy: You look at the road, squint, write down "Car is 20 feet away," and then drive based on your own shaky handwriting.
The "Math-Only" Player (Symbol Only): The robot gets only the list of numbers, no screen at all.
- Analogy: Driving a car blindfolded, relying only on the GPS voice.

2. The Big Discovery: "Garbage In, Garbage Out"

The most important finding is that symbols are only helpful if they are accurate.

When the robot is good at seeing: (Like the model named Claude-4-Sonnet).
- If it looks at the screen and writes down the coordinates correctly, giving it that list makes it a champion. It plays almost as well as if it had the perfect cheat sheet.
- Metaphor: If your GPS is accurate, it helps you drive perfectly.
When the robot is bad at seeing: (Like GPT-4o and Gemini in complex games).
- If the robot looks at the screen, gets confused, and writes down the wrong coordinates (e.g., "The alien is at the top" when it's actually at the bottom), giving it that list hurts its performance. It trusts the wrong numbers and crashes.
- Metaphor: If your GPS is broken and tells you to drive off a cliff, you are better off ignoring it and just looking out the window.

3. The "Blindfold" Problem

The researchers tried the "Math-Only" approach (giving the robot numbers but no picture).

Result: The robots failed miserably.
Why? Even with perfect numbers, the robot didn't know what the numbers meant. It didn't know if the "ball" was moving fast or slow, or if the "paddle" was broken.
Metaphor: Knowing the coordinates of a fire doesn't help you put it out if you can't see the fire to know which way the wind is blowing. You need the picture to understand the numbers.

4. The Resolution Fix

They also tested if making the picture bigger helped.

Finding: Yes! When they zoomed in and made the game screen much higher resolution, the robots got much better at writing down the correct coordinates.
Metaphor: It's the difference between trying to read a tiny, blurry street sign from 100 yards away versus walking right up to it. If the sign is clear, you can read it; if it's blurry, you'll guess wrong.

5. The Noise Experiment

They simulated "noise" by intentionally messing up the coordinates (telling the robot the ball is 20 pixels to the left of where it actually is).

Finding: Even a tiny bit of error made the robots perform much worse.
Metaphor: If you are trying to thread a needle, and someone tells you the eye of the needle is 1 millimeter to the left, you will miss every time. Precision matters.

The Bottom Line

The paper concludes that Vision-Language Models are currently stuck in a "Perception Bottleneck."

Giving an AI a list of facts (symbols) is a great idea in theory, but in practice, it only works if the AI is smart enough to read the world correctly first.

If the AI can see clearly and write down the facts accurately? Symbols make it a superhero.
If the AI is confused and writes down the wrong facts? Symbols make it a disaster.

The takeaway for the future: Before we can build perfect AI agents that use "cheat sheets" to play games or drive cars, we first need to fix their eyesight so they can read the world accurately.

1. Problem Statement

Vision-Language Models (VLMs) excel at describing visual scenes but struggle to translate perception into precise, grounded actions in interactive environments. While VLMs can interpret visual frames, they often fail at the spatial reasoning required for tasks like robotics or game playing, leading to misidentified objects, repeated ineffective actions, and poor control precision.

The core research question is: Does providing VLMs with symbolic representations (object coordinates and IDs) alongside visual frames improve their decision-making, and under what conditions does this help or hinder performance?

2. Methodology

Experimental Setup

The authors evaluated three state-of-the-art VLMs (Claude-4-Sonnet, GPT-4o, and Gemini-2.5-Pro) in a zero-shot setting (no fine-tuning) across three domains:

Atari Games: Pong, Breakout, and Space Invaders (varying object counts from 2 to 50).
3D/Photorealistic Environments: VizDoom (first-person shooter) and AI2-THOR (embodied kitchen tasks).

Four Evaluation Pipelines

To isolate the impact of visual vs. symbolic information, the authors compared four distinct input pipelines:

Frame-only (F): The VLM receives only the raw game frame.
Symbol-only (S-GT): The VLM receives perfect object coordinates from game RAM (via OCAtari) but no visual frame.
Frame + Ground-Truth Symbols (F+S-GT): The VLM receives the frame plus perfect object data extracted from RAM. This serves as the upper bound for performance.
Frame + Self-Extracted Symbols (F+S-self): The VLM first extracts object symbols (IDs, coordinates, confidence) from the frame, then uses both the frame and these extracted symbols to decide the next action.

Metrics

Gameplay Performance: Cumulative reward (normalized to 0–100 scale) over 600 frames.
Detection Quality: F1 Score and Intersection over Union (IoU) comparing VLM-extracted symbols against ground-truth annotations.
Ablation Studies:
- Resolution: Testing detection accuracy at native Atari resolution (160×210) vs. scaled resolutions (up to 1280×720).
- Noise Robustness: Injecting Gaussian noise into coordinates to simulate detection errors and measuring performance degradation.

3. Key Contributions

Conditional Benefit of Symbolic Grounding: The paper establishes that symbolic grounding is not universally beneficial. It improves performance only when the VLM can extract symbols with high accuracy. If the model's internal detection is poor, adding noisy symbols actively degrades performance compared to using frames alone.
The Perception Bottleneck: The study identifies that the primary bottleneck for VLM-based agents is perception quality (object detection and localization), not the reasoning capability itself.
Necessity of Visual Context: Even with perfect symbolic data (Ground Truth), removing the visual frame causes performance to collapse. Visual frames provide essential scaffolding for VLMs to interpret and trust coordinate data.
Resolution Sensitivity: The authors demonstrate that input resolution is critical. Native low-resolution Atari inputs yield poor detection accuracy, whereas scaling images to higher resolutions (e.g., 1280×720) significantly boosts detection quality and subsequent gameplay performance.

4. Key Results

Model-Specific Performance

Claude-4-Sonnet: Demonstrated the highest ability to self-extract accurate symbols (F1 Score: 0.715). Consequently, the F+S-self pipeline significantly improved its gameplay scores, often matching the F+S-GT upper bound (e.g., in Breakout, it reached 12.0 with self-extracted symbols vs. 12.0 with ground truth).
GPT-4o and Gemini-2.5-Pro: Showed lower self-extraction accuracy (F1 Scores: 0.124 and 0.189, respectively). For these models, the F+S-self pipeline often resulted in performance degradation compared to Frame-only, as the noisy symbols misled their reasoning.
Complexity Impact: In dense environments (Space Invaders with 20–50 objects), all models struggled with self-extraction, leading to significant performance drops when relying on self-extracted symbols.

The Role of Visual Grounding

Symbol-Only Failure: In the S-GT pipeline (symbols only, no image), performance collapsed across all models. For example, in Breakout, Claude-4-Sonnet and GPT-4o scored 0 without visual context, proving that coordinates alone are insufficient for VLMs to understand game states.

Ablation Findings

Resolution: Increasing input resolution from native (160×210) to 1280×720 doubled the average F1 score (from ~0.31 to ~0.68), directly correlating with better gameplay outcomes.
Noise Sensitivity: Performance is highly sensitive to coordinate noise. Even low noise levels ( $\sigma=0.1$ , ~16–20 pixels of error) caused a 30–40% drop in cumulative reward. In Space Invaders, GPT-4o lost 50% of its performance with minimal noise.

5. Significance and Conclusion

This paper provides a critical reality check for the deployment of VLMs as embodied agents. It argues that symbolic grounding is a double-edged sword:

When reliable: It acts as a powerful enhancer, allowing models to overcome visual ambiguity and make precise spatial decisions.
When unreliable: It introduces hallucinated constraints that confuse the model, making it perform worse than if it relied solely on raw visual perception.

Future Directions:
The authors conclude that the path forward for VLM-based agents lies not in better reasoning architectures alone, but in robust symbol extraction. Future work must focus on improving object detection reliability (e.g., via hybrid detectors or fine-tuned vision modules) and ensuring high-resolution inputs to minimize the "perception bottleneck" before attempting to ground actions in symbolic representations.