GeoWorld: Geometric World Models

GeoWorld introduces a geometric world model that leverages Hyperbolic JEPA and Geometric Reinforcement Learning to preserve latent structural hierarchies and enable stable long-horizon visual planning, achieving state-of-the-art performance on multi-step tasks.

Zeyu Zhang, Danning Li, Ian Reid, Richard Hartley

Published 2026-02-27
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot how to change a memory chip in a computer. The robot needs to figure out a sequence of steps: take off the case, remove the old chip, put in the new one, and snap the case back on.

Most current AI robots try to solve this by imagining the future. They try to generate a video of what the next second looks like, then the next, then the next. But this is like trying to walk across a room by guessing what every single tile looks like before you step on it. If you make a tiny mistake guessing the color of the first tile, your guess for the second tile is wrong, and by the time you get to the tenth tile, you're completely lost. The robot gets confused, and the plan falls apart.

GeoWorld is a new way of teaching robots to plan. Instead of trying to "see" the future pixel-by-pixel, it learns to "feel" the shape of the future. Here is how it works, using some simple analogies:

1. The Flat Map vs. The Mountain Range (Euclidean vs. Hyperbolic)

Imagine you are trying to navigate a city.

  • Old AI (Euclidean): It uses a flat, square grid map. On this map, every direction feels the same. If you want to go from your house to a friend's house, it just draws a straight line. But real life isn't a flat grid; it has layers. There are neighborhoods, then streets, then buildings, then rooms. A flat map doesn't understand that "Room A" is inside "Building B."
  • GeoWorld (Hyperbolic): It uses a mountain range map. In this world, the "distance" between two things isn't just how far apart they are; it's about how they are related.
    • Think of a family tree. You are close to your parents, but very far from your great-great-grandparents. On a flat map, everyone is just a dot. On a mountain map (Hyperbolic space), the "valleys" naturally group related things together.
    • GeoWorld maps the robot's tasks onto this mountain range. It knows that "taking off the case" is a big step away from "putting the chip in," but "putting the chip in" is right next to "snapping the case back on." This shape helps the robot understand the hierarchy of the task naturally.

2. The Energy Landscape (The Gravity Hill)

Imagine the robot is a marble rolling on a surface.

  • The Goal: The robot wants to get from "Start" to "Finish."
  • The Energy: In this world, "Energy" is like height. High energy means "this is a bad, difficult, or impossible path." Low energy means "this is an easy, natural path."
  • The Old Way: The robot tries to guess the path by looking at the ground. If the ground is flat (Euclidean), the marble might roll off a cliff because the map didn't show the drop.
  • The GeoWorld Way: The robot learns the shape of the hills and valleys (the Energy Landscape). It knows that the "good" path is a smooth valley (a geodesic, which is the shortest path on a curved surface). It doesn't need to guess every pixel; it just needs to roll down the valley toward the goal. Because the valley is shaped correctly (thanks to the mountain map), the robot doesn't get lost, even if the trip is long.

3. The Coach (Geometric Reinforcement Learning)

Even with a great map, a robot might still take a weird shortcut that looks okay for a second but fails later.

  • GeoWorld adds a Coach: This is called Geometric Reinforcement Learning.
  • Imagine the robot is practicing a dance routine. The Coach doesn't just say "Good job" or "Bad job." The Coach says, "You took a step that broke the rhythm of the dance."
  • The Coach uses a rule called the Triangle Inequality. In simple terms: "If you go from Point A to Point B, and then to Point C, the total distance shouldn't be weirdly shorter than going A to C directly."
  • If the robot tries to take a "shortcut" that breaks the natural flow of the task, the Coach pushes it back onto the smooth valley path. This stops the robot from making small mistakes that pile up into a disaster later on.

Why Does This Matter?

If you ask an old AI to plan 3 steps ahead, it might get it right. But if you ask it to plan 6 or 8 steps, it usually fails because the tiny errors add up.

GeoWorld is like a GPS that understands the terrain.

  • It doesn't get lost in the details.
  • It understands that some steps are "big jumps" and others are "small steps."
  • It keeps the robot on the smoothest, most logical path, even for very long and complex tasks.

In a nutshell: GeoWorld stops trying to predict every single frame of a movie and instead learns the shape of the story. By understanding the "curved geometry" of how tasks relate to one another, it can plan complex, multi-step actions without getting confused or losing its way.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →