Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Imagine you are teaching a robot to solve a sliding tile puzzle (like the classic 15-puzzle or 8-puzzle). In the old days, you might have given the robot a digital map of where every tile was. But in the real world, robots don't get maps; they get cameras. They have to look at a picture, figure out where the pieces are, and then decide how to move them.

This paper introduces a new "gym" (a training playground) called SPGym to test exactly how good robots are at this visual task.

Here is the breakdown of what they did, why it matters, and what they found, using some everyday analogies.

1. The Problem: The "Blindfolded Chef"

Imagine you are teaching a chef to cook a specific dish.

Old Benchmarks: You give the chef the recipe, the ingredients, and the exact temperature. If the dish tastes bad, you don't know if it's because the chef can't read the recipe (bad representation) or because they can't control the stove (bad policy).
The Issue: In AI, we often test robots on complex games (like Atari). If the robot fails, we don't know if it's failing because it can't see the game well, or because it can't plan its moves. It's a messy mix.

2. The Solution: The "Sliding Puzzles Gym" (SPGym)

The authors built a special training ground that isolates the "seeing" part from the "thinking" part.

The Setup: They took a standard sliding puzzle. Instead of numbered tiles (1, 2, 3), they replaced them with random pictures (like a cat, a car, a flower).
The Twist: The rules of the game never change. The tiles always slide the same way. The only thing that changes is how many different pictures the robot has to deal with.
- Level 1: The robot only sees a picture of a cat on every tile. (Easy to learn the pattern).
- Level 100: The robot sees a cat, a car, a flower, a dog, a toaster, a cloud... and 95 other random images. Every time it plays, the tiles are made of different pictures.

The Analogy: Imagine you are learning to drive.

Level 1: You only drive on a straight road with a blue sky.
Level 100: You drive on the same straight road, but the sky changes color, the trees change shape, and the road texture changes every single second.
The Goal: Can you still drive straight? If you crash, it's not because the road is harder; it's because your eyes can't process the changing scenery fast enough.

3. What They Tested

They took the smartest AI "students" (algorithms like SAC, PPO, and DreamerV3) and put them through this gym. They wanted to see:

Sample Efficiency: How many tries does it take to learn?
Generalization: If the robot learns with 5 pictures, can it handle 50? Can it handle a completely new picture it has never seen before?

4. The Shocking Results

The results were a bit like a reality check for the AI world.

The "Memorizers" vs. The "Understanders":
Most of the advanced AI methods failed when the number of pictures increased. It turned out they weren't really "learning" to solve the puzzle visually. They were memorizing specific patterns.
- Analogy: Imagine a student who memorizes the answer key for a math test with 5 questions. If you give them a test with 50 questions, they fail. They didn't learn math; they just learned those 5 answers.
The Simple Wins:
Surprisingly, the simplest method—Data Augmentation (basically, showing the robot slightly altered versions of the same picture, like turning it black-and-white or flipping the colors)—worked better than the fancy, complex methods.
- Analogy: It's like telling the student, "Don't just memorize the answer; practice with the lights dimmed and the paper upside down." This forces them to understand the shape of the problem, not just the specific numbers.
The "Hard" Test (The Ultimate Fail):
When they tested the robots on pictures they had never seen before (e.g., trained on cats, tested on a picture of a toaster), almost all of them failed completely (0% success).
- The Takeaway: The robots didn't actually understand the puzzle. They just memorized the specific images they saw during training. They couldn't transfer that knowledge to a new visual world.
The Star Performer:
One algorithm, DreamerV3, did the best. It builds a "world model" (it tries to predict what will happen next). It was the only one that could handle a decent amount of visual variety without completely falling apart.

5. Why This Matters

This paper is a wake-up call. It tells us that while AI is getting better at playing games, it is still terrible at generalizing what it sees.

Current AI: "I know how to solve this puzzle because I saw these specific pictures 1,000 times."
What We Need: "I know how to solve this puzzle because I understand how tiles slide, regardless of what picture is on them."

The Bottom Line

The authors created a "stress test" for robot eyes. They found that current AI is fragile. If you change the visual environment too much, the robot forgets how to think. To build truly intelligent robots that can work in the messy, changing real world, we need to teach them to understand the structure of the world, not just memorize the pictures of it.

In short: We are teaching robots to drive, but right now, they only know how to drive on one specific street with one specific weather pattern. SPGym is the tool we need to teach them how to drive in a blizzard, a sandstorm, and a neon-lit city all at once.

1. Problem Statement

Reinforcement Learning (RL) agents require effective visual representation learning to generalize across diverse environments. However, existing benchmarks (e.g., Atari, ProcGen, Distracting Control Suite) fail to isolate representation learning from other challenges like policy optimization or environment dynamics.

The Gap: In current benchmarks, increasing difficulty often changes both the visual complexity and the task logic simultaneously, or introduces irrelevant visual distractors. This makes it impossible to determine if an agent's failure is due to poor visual representation or an inability to solve the task logic.
The Goal: The authors aim to create a benchmark that allows researchers to systematically scale visual diversity while keeping environment dynamics, action spaces, and task logic strictly constant. This isolates the visual representation challenge to evaluate how well agents can extract task-relevant information from raw pixels.

2. Methodology: Sliding Puzzles Gym (SPGym)

The authors introduce SPGym, a novel open-source benchmark based on the classic 8-tile sliding puzzle.

Core Design Principles

Fixed Dynamics: The underlying mechanics of the puzzle (sliding tiles into an empty space) and the transition dynamics remain identical regardless of the visual input.
Scalable Visual Complexity:
- Image Pools: Instead of numbered tiles, the puzzle uses image patches. At the start of each training run, a pool of $p$ images is sampled from a dataset (e.g., ImageNet).
- Episode Variation: In each episode, one image is randomly selected from the pool, partitioned into $H \times W$ patches, and shuffled to form the initial state.
- Control: By increasing the pool size ( $p$ ), the visual diversity increases, forcing the agent to learn representations that generalize across many different visual inputs, even though the state space and action space remain fixed.
Reward Function: The reward is based on the normalized Manhattan distance between the current tile positions and their target positions. This provides a dense, well-shaped learning signal ( $[-1, +1]$ ) that guides the agent toward the goal without relying on sparse rewards.

Formalization

SPGym is formulated as a Partially Observable Markov Decision Process (POMDP) $(S, A, P, R, S_0, \Omega, O)$ :

Observation ( $\Omega$ ): A composite image formed by arranging patches from a selected source image.
Action Space ( $A$ ): Discrete actions (UP, DOWN, LEFT, RIGHT) moving a tile into the empty space.
State Space ( $S$ ): All solvable configurations of the grid.
Key Feature: The agent has no access to the internal state (e.g., tile coordinates); it must learn solely from pixel observations.

3. Experimental Setup

Algorithms Evaluated: The authors tested a wide range of state-of-the-art (SOTA) visual RL algorithms, including:
- Model-Free: PPO (Proximal Policy Optimization), SAC (Soft Actor-Critic) with various representation learning variants (RAD, CURL, SPR, DBC, AE, VAE, Simple Baseline).
- Model-Based: DreamerV3.
Datasets: Primary experiments used ImageNet-1k (validation split). Additional experiments used DiffusionDB (procedurally generated images) to verify that results are not dataset-specific.
Metrics:
- Sample Efficiency: Number of environment steps required to reach an 80% success rate.
- Generalization: Performance on "Easy" OOD (augmented training images) and "Hard" OOD (completely unseen images).
- Representation Quality: Linear probing accuracy on frozen encoders to measure how well the learned features separate task states.

4. Key Results

A. Representation Learning Capabilities

DreamerV3 Dominance: DreamerV3 demonstrated the most robust performance, scaling effectively up to pool sizes of 50 and showing learning even at pool size 100. Its world model architecture (predictive latent space + reconstruction) proved superior for handling visual diversity.
Data Augmentation (RAD) vs. Complex Methods: Surprisingly, simple data augmentation (RAD) consistently outperformed sophisticated auxiliary methods like CURL, SPR, and VAE. Complex methods often underperformed standard SAC, particularly as visual diversity increased.
Pretraining Limits: While in-distribution pretraining helped PPO, out-of-distribution pretraining offered diminishing returns as pool size increased, suggesting that general-purpose encoders do not easily transfer to this specific visual-structural task.

B. Generalization Failures

Memorization vs. Generalization: Agents that achieved high success rates on their training image pools almost universally failed (near 0% success) when tested on completely unseen images (Hard OOD).
The "Paradox" of Diversity: Training on larger, more diverse pools often degraded generalization performance on augmented inputs. Agents trained on smaller pools learned specific structural invariances that made them robust to simple perturbations, whereas larger pools led to overfitting on specific visual patterns rather than learning the underlying logic.

C. Correlation with Sample Efficiency

Linear Probing: A strong negative correlation ( $r \approx -0.81$ ) was found between linear probe accuracy (how well an encoder predicts the ground-truth state) and sample efficiency. Agents with better task-relevant spatial representations learned the task significantly faster.

D. Scalability

Grid Size: Increasing the grid from $3\times3$ to $4\times4$ drastically increased the state space ( $10^5$ vs. $10^{13}$ states). While this challenged exploration, sample-efficient algorithms (DreamerV3, SAC) could still solve the task, whereas PPO failed to converge within the step budget.

5. Key Contributions

SPGym Benchmark: A novel, open-source benchmark that decouples visual complexity from task dynamics, allowing for the systematic stress-testing of visual representation learning.
Empirical Analysis: A comprehensive evaluation revealing that current SOTA visual RL methods struggle with high visual diversity and rely heavily on memorization rather than true generalization.
Insights on Representation: Identification that simple data augmentation often outperforms complex contrastive or predictive auxiliary losses in this domain, and that world-model-based approaches (DreamerV3) are currently the most robust to visual diversity.

6. Significance and Future Directions

Critical Gap Identification: The paper exposes a fundamental limitation in end-to-end visual RL: agents can master a task on specific visual inputs but fail to transfer that knowledge to novel visual contexts, indicating a reliance on memorization.
Research Direction: The findings suggest that future RL research must move beyond simple sample efficiency metrics. There is a need for:
- Architectures that explicitly separate visual representation from policy learning.
- Inductive biases that encourage learning structural invariances over visual content.
- Regularization techniques that prevent memorization of specific training distributions.
Tool for Progress: SPGym provides a controlled environment to drive the development of robust, generalizable decision-making systems, which are essential for real-world applications like robotics and autonomous driving where visual inputs are unstructured and highly diverse.