GeoWorld: Geometric World Models

Imagine you are trying to teach a robot how to change a memory chip in a computer. The robot needs to figure out a sequence of steps: take off the case, remove the old chip, put in the new one, and snap the case back on.

Most current AI robots try to solve this by imagining the future. They try to generate a video of what the next second looks like, then the next, then the next. But this is like trying to walk across a room by guessing what every single tile looks like before you step on it. If you make a tiny mistake guessing the color of the first tile, your guess for the second tile is wrong, and by the time you get to the tenth tile, you're completely lost. The robot gets confused, and the plan falls apart.

GeoWorld is a new way of teaching robots to plan. Instead of trying to "see" the future pixel-by-pixel, it learns to "feel" the shape of the future. Here is how it works, using some simple analogies:

1. The Flat Map vs. The Mountain Range (Euclidean vs. Hyperbolic)

Imagine you are trying to navigate a city.

Old AI (Euclidean): It uses a flat, square grid map. On this map, every direction feels the same. If you want to go from your house to a friend's house, it just draws a straight line. But real life isn't a flat grid; it has layers. There are neighborhoods, then streets, then buildings, then rooms. A flat map doesn't understand that "Room A" is inside "Building B."
GeoWorld (Hyperbolic): It uses a mountain range map. In this world, the "distance" between two things isn't just how far apart they are; it's about how they are related.
- Think of a family tree. You are close to your parents, but very far from your great-great-grandparents. On a flat map, everyone is just a dot. On a mountain map (Hyperbolic space), the "valleys" naturally group related things together.
- GeoWorld maps the robot's tasks onto this mountain range. It knows that "taking off the case" is a big step away from "putting the chip in," but "putting the chip in" is right next to "snapping the case back on." This shape helps the robot understand the hierarchy of the task naturally.

2. The Energy Landscape (The Gravity Hill)

Imagine the robot is a marble rolling on a surface.

The Goal: The robot wants to get from "Start" to "Finish."
The Energy: In this world, "Energy" is like height. High energy means "this is a bad, difficult, or impossible path." Low energy means "this is an easy, natural path."
The Old Way: The robot tries to guess the path by looking at the ground. If the ground is flat (Euclidean), the marble might roll off a cliff because the map didn't show the drop.
The GeoWorld Way: The robot learns the shape of the hills and valleys (the Energy Landscape). It knows that the "good" path is a smooth valley (a geodesic, which is the shortest path on a curved surface). It doesn't need to guess every pixel; it just needs to roll down the valley toward the goal. Because the valley is shaped correctly (thanks to the mountain map), the robot doesn't get lost, even if the trip is long.

3. The Coach (Geometric Reinforcement Learning)

Even with a great map, a robot might still take a weird shortcut that looks okay for a second but fails later.

GeoWorld adds a Coach: This is called Geometric Reinforcement Learning.
Imagine the robot is practicing a dance routine. The Coach doesn't just say "Good job" or "Bad job." The Coach says, "You took a step that broke the rhythm of the dance."
The Coach uses a rule called the Triangle Inequality. In simple terms: "If you go from Point A to Point B, and then to Point C, the total distance shouldn't be weirdly shorter than going A to C directly."
If the robot tries to take a "shortcut" that breaks the natural flow of the task, the Coach pushes it back onto the smooth valley path. This stops the robot from making small mistakes that pile up into a disaster later on.

Why Does This Matter?

If you ask an old AI to plan 3 steps ahead, it might get it right. But if you ask it to plan 6 or 8 steps, it usually fails because the tiny errors add up.

GeoWorld is like a GPS that understands the terrain.

It doesn't get lost in the details.
It understands that some steps are "big jumps" and others are "small steps."
It keeps the robot on the smoothest, most logical path, even for very long and complex tasks.

In a nutshell: GeoWorld stops trying to predict every single frame of a movie and instead learns the shape of the story. By understanding the "curved geometry" of how tasks relate to one another, it can plan complex, multi-step actions without getting confused or losing its way.

1. Problem Statement

The paper addresses two critical limitations in existing Energy-Based Predictive World Models (such as V-JEPA 2) used for multi-step visual planning:

Geometric Neglect (Euclidean Limitations): Current models learn latent representations in Euclidean space. This fails to capture the inherent hierarchical and geometric structure of state transitions. In real-world tasks, the number of possible future trajectories branches exponentially with the planning horizon. Euclidean space cannot naturally encode these hierarchical relationships, leading to a lack of "geodesic awareness" (the shortest path between states) and poor long-horizon planning.
Multi-Step Shortcomings (Error Accumulation): Existing models are primarily trained on single-step transitions. When applied to long-horizon planning (predicting $T$ steps ahead), they suffer from rapid performance degradation due to accumulated errors and the inability to model long-term temporal dependencies effectively.

2. Methodology: GeoWorld

The authors propose GeoWorld, a geometric world model that integrates Hyperbolic Geometry and Geometric Reinforcement Learning (GRL) to preserve structural relationships in the latent space.

A. Hyperbolic JEPA (H-JEPA)

Instead of learning in Euclidean space ( $\mathbb{R}^n$ ), GeoWorld maps latent representations onto a Hyperbolic Manifold ( $\mathcal{H}^n$ ), specifically using the Poincaré ball model.

Mapping: The encoder output $s_t \in \mathbb{R}^n$ is treated as a tangent vector at the origin and projected onto the hyperbolic manifold using the Exponential Map ( $\text{exp}_0$ ).
Dynamics: The predictor $P_\phi$ learns to predict future states along hyperbolic geodesics. In hyperbolic space, geodesic distances naturally encode hierarchical relations (tree-like structures), where states at different levels of abstraction are separated by distances that reflect their structural depth.
Training Objective: The model minimizes the Poincaré-ball hyperbolic distance ( $d_H$ $d_{H}$ ) between predicted and ground-truth latent states.
- Teacher Forcing Loss: Aligns one-step predictions with ground truth.
- Rollout Loss: Feeds predictions back as inputs to enforce consistency over multiple steps ( $T=2$ ).

B. Geometric Reinforcement Learning (GRL)

To further stabilize long-horizon planning, the authors introduce GRL, which treats planning as the optimization of an energy-based value function directly on the latent manifold.

Energy-Reward Mapping: The "cost" of a transition is defined as the hyperbolic distance between the predicted and target states. The reward is the negative of this energy cost.
Value Function: The goal is to maximize cumulative reward (minimize total hyperbolic distance) from the current state to the goal state.
Triangle Inequality Regularization: A key innovation is the addition of a regularization term ( $L_\Delta$ ) that enforces the triangle inequality on the predicted trajectory:
$d_H(\hat{s}_t, \hat{s}_{t+2}) \leq d_H(\hat{s}_t, \hat{s}_{t+1}) + d_H(\hat{s}_{t+1}, \hat{s}_{t+2})$
This forces the predicted trajectory to follow a geodesic path, preventing the model from taking "shortcuts" that violate the geometric structure of the latent space.

C. Energy-Based Planning

During inference, planning is performed using the Cross-Entropy Method (CEM).

The frozen encoder and predictor act as a world model.
CEM searches for an action sequence that minimizes the hyperbolic energy cost (distance) between the imagined future latent state and the goal latent state.
This allows for efficient trajectory optimization without generating pixels, relying instead on the structured energy landscape of the hyperbolic manifold.

3. Key Contributions

Hyperbolic JEPA (H-JEPA): A novel architecture that maps latent states to hyperbolic manifolds, enabling the model to naturally encode hierarchical state relationships and preserve geometric structure during multi-step prediction.
Geometric Reinforcement Learning (GRL): An optimization framework that refines the world model predictor by minimizing hyperbolic energy and enforcing triangle inequality constraints, ensuring geodesic-consistent rollouts.
State-of-the-Art Performance: Demonstrated significant improvements in long-horizon visual planning tasks compared to the previous SOTA (V-JEPA 2).

4. Experimental Results

The method was evaluated on two standard benchmarks for goal-conditioned visual planning: CrossTask and COIN.

Metrics: Success Rate (SR), Mean Accuracy (mAcc), and Mean Intersection over Union (mIoU).
Procedural Planning (Image-to-Image):
- GeoWorld outperformed V-JEPA 2 across all model scales (ViT-L, ViT-H, ViT-g, ViT-g384).
- Improvements: ~3% SR improvement in 3-step planning and ~2% SR improvement in 4-step planning.
Visual Planning (Video-to-Video):
- GeoWorld surpassed both generative models (e.g., VideoWorld) and LLM-based planners (e.g., GPT-5, Gemini 2.5 Pro) in specific long-horizon metrics.
Long-Horizon Stability (T=3 to T=6):
- As the planning horizon increased, Euclidean models (V-JEPA 2) suffered rapid performance degradation due to error accumulation.
- GeoWorld maintained significantly higher stability. For example, at $T=6$ , GeoWorld achieved a Success Rate of 18.26% compared to 16.88% for V-JEPA 2, with the gap widening as $T$ increased further (up to $T=8$ in ablation studies).
Ablation Studies:
- Curvature: The model learns an optimal curvature ( $c \approx 0.3$ ) that balances hierarchical representation with stability.
- GRL: Applying GRL on top of Supervised Fine-Tuning (SFT) provided consistent gains, proving that geometric regularization is crucial for long-horizon consistency.

5. Significance

Bridging Geometry and Planning: The paper establishes that the geometric properties of the latent space are not just a mathematical curiosity but a functional necessity for long-horizon planning. By moving from Euclidean to Hyperbolic space, the model inherently respects the exponential branching of future possibilities.
Efficiency: Unlike generative models that decode pixels (computationally expensive and noisy), GeoWorld operates purely in latent space using energy minimization, making it more efficient and robust.
Generalization: The approach offers a new paradigm for building world models that can reason about complex, hierarchical tasks (like robot manipulation or procedural tasks) by leveraging the natural geometry of state transitions, potentially paving the way for more robust embodied AI agents.

GeoWorld: Geometric World Models

1. The Flat Map vs. The Mountain Range (Euclidean vs. Hyperbolic)

2. The Energy Landscape (The Gravity Hill)

3. The Coach (Geometric Reinforcement Learning)

Why Does This Matter?

1. Problem Statement

2. Methodology: GeoWorld

A. Hyperbolic JEPA (H-JEPA)

B. Geometric Reinforcement Learning (GRL)

C. Energy-Based Planning

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation