Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning

Imagine you are teaching a robot to play a video game, like Hollow Knight or an old-school Atari game.

The Problem: The Robot is "Blind" to What Matters

Most modern AI robots learn by staring at the screen and trying to guess what will happen next. They are like students trying to learn a subject by reading the entire textbook, including the massive, boring background chapters, just to find the one crucial formula on page 50.

In video games, the screen is full of things: the sky, the ground, the clouds, and the enemies.

The Old Way: The robot tries to recreate the entire picture perfectly. It spends 99% of its brainpower remembering the color of the clouds and the texture of the dirt.
The Result: It forgets the important stuff. It doesn't notice the tiny boss monster jumping at it because the boss is small compared to the huge background. It learns slowly and makes mistakes because it's overwhelmed by "visual noise."

The Solution: OC-STORM (The "Spotlight" Robot)

The authors of this paper created a new method called OC-STORM. Think of it as giving the robot a magic spotlight and a smart assistant.

Instead of trying to memorize the whole screen, the robot is taught to ignore the background and focus only on the "actors" in the scene: the player character, the enemies, and the ball.

Here is how it works, using a simple analogy:

1. The "Few-Shot" Introduction (The Cheat Sheet)

Usually, teaching a robot to recognize objects requires showing it thousands of labeled pictures. That's like hiring a teacher to spend years drawing circles around every car in a photo album.

OC-STORM is different. It uses a "cheat sheet." You only need to show the robot 6 to 12 frames (a few seconds of video) and say, "Hey, that is the player, and that is the boss."

The Magic: The robot uses a pre-trained "vision assistant" (like a super-smart camera app we already have) to instantly recognize these objects in every future frame, even if they move, change size, or get hit. It doesn't need to relearn what a "boss" looks like; it just remembers the cheat sheet.

2. The "Imagination" Engine (The Simulator)

Once the robot knows who the important characters are, it builds a mental model of the game.

Old Robot: "If I move left, the whole screen shifts left, including the clouds and the dirt." (Too much data to process).
OC-STORM: "If I move left, the player moves left, and the boss might jump. The clouds don't matter."

It creates a simplified, fast-forward simulation in its head. It predicts: "If I attack now, the boss will dodge, and I will take damage." Because it's only tracking the important actors, it can simulate thousands of scenarios in the time it takes the old robot to simulate one.

3. The Result: Learning at Light Speed

Because the robot isn't wasting energy on the background, it learns much faster.

In the Atari games: It learned to play better than previous robots using only 100,000 frames (a tiny amount of data).
In Hollow Knight: This is a hard game with complex, messy graphics. Old robots failed to beat the bosses because they got confused by the visual chaos. OC-STORM, by focusing only on the fighters, learned to dodge and attack effectively, beating the bosses with far fewer attempts than anyone else.

The Big Picture

Think of the old method as trying to learn to drive a car by memorizing the color of every tree you pass.
OC-STORM is like putting on a pair of glasses that highlights only the road, the other cars, and the traffic lights. It ignores the trees.

By focusing on the objects that matter (the "actors") rather than the pixels that don't (the "background"), this new method allows AI to learn complex tasks with very little practice, making it a huge step forward for real-world applications like self-driving cars or robot assistants.

1. Problem Statement

Deep Reinforcement Learning (RL) agents operating on pixel-based observations suffer from severe sample inefficiency, often requiring orders of magnitude more data than humans to master tasks. While Model-Based RL (MBRL) attempts to solve this by learning a "world model" to simulate experience, standard approaches rely on pixel-wise reconstruction losses (e.g., L2 loss).

The Core Limitation: In complex, dynamic scenes, pixel-level reconstruction objectives are dominated by large, static background elements. Consequently, the model fails to capture small, sparse, yet decision-critical objects (e.g., a boss character or a ball).
The Consequence: The learned world model accurately reconstructs the background but misses essential entities, leading to poor policy learning and suboptimal performance in visually complex environments like the game Hollow Knight.

2. Methodology: OC-STORM

The authors propose OC-STORM, a framework that integrates Object-Centric (OC) representations into MBRL using few-shot annotations and pretrained segmentation models.

A. Core Architecture

The method consists of two stages:

Learning an Object-Centric World Model:
- Input: The agent receives raw visual observations ( $o_t$ ) and a small set of annotated frames (e.g., 6–12 frames) where key objects are labeled.
- Feature Extraction: A frozen, pretrained video segmentation network (specifically Cutie or SAM2) extracts compact object feature vectors ( $s^{obj}_t$ ) from the annotated frames. These models utilize retrieval-based mechanisms to track objects across frames without further training.
- Latent Discretization: Both the object features and downsampled visual inputs (64x64) are encoded into discrete latent representations using a Categorical VAE.
  - Object latents: Encoded via MLPs into categorical distributions.
  - Visual latents: Encoded via CNNs into categorical distributions.
- Dynamics Modeling: The core innovation is a Spatial-Temporal Transformer (based on the STORM backbone) or an RNN (DreamerV3).
  - It processes a sequence of tokens: $K$ object tokens + 1 visual token.
  - Spatial Attention: Models interactions between objects and the scene at each timestep.
  - Temporal Attention: Models the dynamics of each token over time.
- Prediction Heads: The model predicts the next latent state, reward, and termination signal.
Policy Training:
- The policy ( $\pi_\theta$ ) and value function ( $V_\psi$ ) are trained entirely on imagined trajectories generated by the world model, using an Actor-Critic algorithm (similar to DreamerV3).

B. Key Design Choices

Few-Shot Annotation: Instead of requiring dense, per-frame labeling, the system only needs a handful of frames to initialize the segmentation model's memory bank. The pretrained model then tracks these objects automatically.
Vector vs. Mask Representation: The paper compares vector-based object features (semantic embeddings) against mask-based representations (binary masks). They find that vector representations are superior because they preserve semantic context and are computationally efficient, whereas masks suffer from resolution loss and noise.
Robustness to Failure: The system handles segmentation failures (e.g., lost tracking) by zeroing out the object feature vector, allowing the model to rely on visual inputs or learn to ignore the missing object.

3. Key Contributions

OC-STORM Framework: The first MBRL framework to successfully integrate few-shot, pretrained object segmentation models (Cutie/SAM2) into world models for both standard benchmarks (Atari 100k) and complex visual games (Hollow Knight), without needing internal game states.
State-of-the-Art Sample Efficiency: Demonstrated significant improvements over baselines (STORM, DreamerV3) in environments where critical information is localized in small objects.
Comprehensive Evaluation: Extensive ablation studies comparing:
- Different backbones (STORM vs. DreamerV3).
- Different segmentation models (Cutie vs. SAM2).
- Representation types (Vector vs. Mask).
- Robustness to segmentation errors and varying numbers of annotations.

4. Experimental Results

A. Atari 100k Benchmark

Performance: OC-STORM (specifically the Cutie-OC-STORM variant) achieved a Human-Normalized Score (HNS) mean of 134.8%, significantly outperforming the STORM baseline (124.6%) and DreamerV3 (119.4%).
Object Detectability: The method showed substantial gains in games where objects are easily detectable (e.g., Pong, Boxing, BattleZone). Even in games where detection is incomplete, performance remained comparable to baselines, demonstrating robustness.
Efficiency: The vector-based approach outperformed mask-based approaches (like FOCUS), which struggled with resolution constraints.

B. Hollow Knight (Boss Fights)

Context: A highly complex environment with dynamic backgrounds and multiple moving entities, where standard pixel-based models often fail to track the boss.
Results: OC-STORM converged significantly faster and achieved higher episode returns than the STORM baseline across various bosses (e.g., God Tamer, Mage Lord, Pure Vessel).
- For the Mage Lord boss, OC-STORM achieved a win rate of 48% compared to 5% for the baseline.
- For Hornet Protector, it achieved 100% win rate vs. 66.7% for the baseline.
Significance: This proves the method's viability in real-world-like, visually cluttered scenarios where object-level reasoning is crucial.

C. Continuous Control (Meta-World)

OC-STORM was also tested on continuous control tasks, showing higher sample efficiency than STORM and competitive performance against MWM (Masked World Models), indicating the approach generalizes beyond discrete pixel environments.

5. Significance and Limitations

Significance:

Bridging Vision and RL: The paper demonstrates that integrating modern, foundation-model-based computer vision (segmentation) with RL world models is a viable path to solving sample inefficiency.
Practicality: By requiring only a few annotated frames, the method avoids the prohibitive cost of dense labeling while still leveraging strong inductive biases about object dynamics.
Generalization: The approach works across diverse domains, from simple 2D arcade games to complex 3D-style 2D platformers.

Limitations:

Duplicate Instances: Current segmentation models (Cutie/SAM2) struggle to distinguish between multiple identical objects (e.g., two identical enemies), potentially leading to tracking errors.
Geometric Representation: The object-centric approach is less effective at encoding static geometric structures (walls, boundaries) which are not "objects" in the traditional sense. The authors retain raw visual inputs to mitigate this, but it remains a challenge for pure object-centric representations.

Conclusion

OC-STORM represents a meaningful step forward in Model-Based RL. By shifting the model's capacity from reconstructing the entire pixel space to reasoning about semantically meaningful entities, the authors achieve state-of-the-art sample efficiency. This suggests that future RL agents should leverage foundation models for object perception to handle the complexity of real-world visual environments.