Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning

This paper introduces OC-STORM, an object-centric model-based reinforcement learning framework that leverages few-shot annotated frames and pretrained segmentation to improve sample efficiency and dynamics prediction in complex visual environments, outperforming existing baselines on Atari 100k and Hollow Knight benchmarks.

Weipu Zhang, Adam Jelley, Trevor McInroe, Amos Storkey, Gang Wang

Published 2026-02-26
📖 4 min read☕ Coffee break read

Imagine you are teaching a robot to play a video game, like Hollow Knight or an old-school Atari game.

The Problem: The Robot is "Blind" to What Matters

Most modern AI robots learn by staring at the screen and trying to guess what will happen next. They are like students trying to learn a subject by reading the entire textbook, including the massive, boring background chapters, just to find the one crucial formula on page 50.

In video games, the screen is full of things: the sky, the ground, the clouds, and the enemies.

  • The Old Way: The robot tries to recreate the entire picture perfectly. It spends 99% of its brainpower remembering the color of the clouds and the texture of the dirt.
  • The Result: It forgets the important stuff. It doesn't notice the tiny boss monster jumping at it because the boss is small compared to the huge background. It learns slowly and makes mistakes because it's overwhelmed by "visual noise."

The Solution: OC-STORM (The "Spotlight" Robot)

The authors of this paper created a new method called OC-STORM. Think of it as giving the robot a magic spotlight and a smart assistant.

Instead of trying to memorize the whole screen, the robot is taught to ignore the background and focus only on the "actors" in the scene: the player character, the enemies, and the ball.

Here is how it works, using a simple analogy:

1. The "Few-Shot" Introduction (The Cheat Sheet)

Usually, teaching a robot to recognize objects requires showing it thousands of labeled pictures. That's like hiring a teacher to spend years drawing circles around every car in a photo album.

OC-STORM is different. It uses a "cheat sheet." You only need to show the robot 6 to 12 frames (a few seconds of video) and say, "Hey, that is the player, and that is the boss."

  • The Magic: The robot uses a pre-trained "vision assistant" (like a super-smart camera app we already have) to instantly recognize these objects in every future frame, even if they move, change size, or get hit. It doesn't need to relearn what a "boss" looks like; it just remembers the cheat sheet.

2. The "Imagination" Engine (The Simulator)

Once the robot knows who the important characters are, it builds a mental model of the game.

  • Old Robot: "If I move left, the whole screen shifts left, including the clouds and the dirt." (Too much data to process).
  • OC-STORM: "If I move left, the player moves left, and the boss might jump. The clouds don't matter."

It creates a simplified, fast-forward simulation in its head. It predicts: "If I attack now, the boss will dodge, and I will take damage." Because it's only tracking the important actors, it can simulate thousands of scenarios in the time it takes the old robot to simulate one.

3. The Result: Learning at Light Speed

Because the robot isn't wasting energy on the background, it learns much faster.

  • In the Atari games: It learned to play better than previous robots using only 100,000 frames (a tiny amount of data).
  • In Hollow Knight: This is a hard game with complex, messy graphics. Old robots failed to beat the bosses because they got confused by the visual chaos. OC-STORM, by focusing only on the fighters, learned to dodge and attack effectively, beating the bosses with far fewer attempts than anyone else.

The Big Picture

Think of the old method as trying to learn to drive a car by memorizing the color of every tree you pass.
OC-STORM is like putting on a pair of glasses that highlights only the road, the other cars, and the traffic lights. It ignores the trees.

By focusing on the objects that matter (the "actors") rather than the pixels that don't (the "background"), this new method allows AI to learn complex tasks with very little practice, making it a huge step forward for real-world applications like self-driving cars or robot assistants.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →