Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning

The Big Picture: Teaching a Robot to Find Its Way

Imagine you are trying to teach a robot to navigate a giant, complex maze. In the old days of robotics, you had to be a "reward engineer." You'd have to manually tell the robot, "Good job if you move left," "Bad job if you hit a wall," "Bonus points for turning right." This is tedious, prone to errors, and often leads to the robot finding weird loopholes (like spinning in circles to get points) instead of actually solving the problem.

Goal-Conditioned RL (GCRL) is a smarter way. Instead of giving the robot a checklist of rewards, you just say, "Go to that spot." The robot's job is simply to figure out how to get there.

This paper introduces a new, super-smart way to teach the robot how to calculate the best path to any goal, using a mix of geometry, physics, and hierarchy.

1. The Map Maker: Quasimetrics (The "Distance" Rule)

First, the authors look at how the robot thinks about distance.

The Old Way: The robot learns by trial and error, step-by-step. It's like walking through a dark room and bumping into things to learn where the walls are.
The New Way (Quasimetrics): The authors realized that the "value" of a state (how good it is to be there) is actually just the shortest distance to the goal.
- Analogy: Imagine the robot has a magical map. On this map, the "value" of a location isn't a number; it's the length of the shortest path to the finish line.
- The Catch: In the real world, you can't always go in a straight line (walls, obstacles). So, the distance from A to B might be different than B to A. This is called a Quasimetric. It's like a one-way street system where the distance depends on the direction you are traveling.

The previous method (QRL) tried to learn this map by looking at specific steps the robot took (e.g., "I moved from here to there"). It was like learning a city by only looking at the specific streets you drove on yesterday.

2. The Physics Upgrade: Eikonal Constraints (The "Speed Limit" Rule)

This is the paper's main innovation. The authors asked: Why do we need to look at specific steps? Can we just use the laws of physics to learn the map?

They used a famous equation from physics called the Eikonal Equation.

The Analogy: Think of a forest fire spreading. The fire spreads at a constant speed in all directions. The "Eikonal Equation" describes the shape of the fire front.
The Application: The authors treat the robot's movement like that fire. They assume the robot moves at a "unit speed" (it takes 1 second to move 1 meter).
The Magic: Instead of needing a video of the robot walking (trajectories), they just need a list of random points in the room. They tell the AI: "The slope of your map must always equal 1."
- If you are 10 meters away, the map value should be 10.
- If you are 5 meters away, the map value should be 5.
- The "slope" (how fast the value changes as you move) must be constant.

Why is this cool?

No Trajectories Needed: You don't need to watch the robot walk. You can just throw random darts at a map of the room, and the AI learns the whole map at once.
Better Generalization: Because it's learning the laws of the space (like a physicist), it can guess the path to a goal it has never seen before, even if it's in a weird spot.

3. The Problem: The "Flat" Map Breaks Down

There's a catch. The "Unit Speed" assumption works great in a simple, empty room (like a point-maze). But what if the robot is a complex ant with 8 legs, or a human-like robot?

The Reality: Real robots have joints, friction, and complex physics. They can't move at a constant speed in every direction. Sometimes they get stuck; sometimes they slide.
The Result: If you try to force a complex robot to follow the simple "Unit Speed" rule, the math breaks, and the robot gets confused. The "flat" map becomes inaccurate.

4. The Solution: Hierarchy (The "General and the Sergeant")

To fix the complexity problem, the authors introduced Hierarchy. They split the robot's brain into two levels:

The General (High-Level):
- Job: Looks at the big picture. It doesn't care about the robot's knee joints or wheel friction. It only cares about the abstract location (e.g., "I need to get to the kitchen").
- The Trick: The General uses the Eikonal method (the simple physics rule) because, in this abstract world, the rules are simple. It breaks the big journey into smaller "sub-goals" (e.g., "Go to the hallway," then "Go to the kitchen").
- Analogy: The General draws a straight line on a map from Start to Finish and says, "Head North for 5 miles, then turn East."
The Sergeant (Low-Level):
- Job: Handles the messy details. It takes the General's order ("Go North") and figures out how to actually move the robot's legs to do it, dealing with friction, slipping, and obstacles.
- The Trick: The Sergeant uses standard, proven methods (Temporal Difference learning) that are good at handling complex, messy physics.

The Result: The General uses the super-efficient "Physics Map" to plan the route, while the Sergeant uses "Street Smarts" to actually drive the car.

Summary of the Win

Old Way: Learn by walking every step (slow, needs lots of data).
Middle Way (QRL): Learn by looking at steps and forcing them to fit a distance rule (better, but still needs step data).
New Way (Eik-HiQRL):
1. Use Physics Laws (Eikonal) to learn the map instantly from random points (no walking needed!).
2. Use Hierarchy to separate the "Simple Planning" from the "Complex Driving."

The Outcome: The robot learns faster, makes fewer mistakes (collisions), and can navigate huge, complex environments (like a giant maze or a robot arm moving a box) better than any previous method. It's like giving the robot a GPS that understands the laws of physics, rather than just a list of turn-by-turn directions.

1. Problem Statement

Goal-Conditioned Reinforcement Learning (GCRL) aims to learn policies that reach arbitrary goals without hand-crafted reward functions, framing tasks as reaching a state $g$ from a state $s$ . A fundamental geometric insight in GCRL is that the optimal value function $V^*(s, g)$ corresponds to the shortest feasible path length, naturally forming a quasimetric (a distance function satisfying the triangle inequality but not necessarily symmetry).

Existing methods, such as Quasimetric RL (QRL), enforce this structure by constraining value learning to quasimetric mappings using discrete, trajectory-based constraints (e.g., $d(s, s') \leq \text{cost}$ ). However, these approaches face several limitations:

Trajectory Dependence: They rely on transition tuples $(s, a, s')$ , requiring rollouts or specific trajectory data.
Discrete Constraints: They enforce local consistency only along observed transitions, limiting out-of-distribution (OOD) generalization.
Complex Dynamics: In high-dimensional or non-smooth environments (e.g., robotics with contacts), the assumptions required for quasimetric learning often break down, leading to poor performance.
Signal-to-Noise Ratio: In long-horizon tasks, value estimates suffer from high variance, making learning unstable.

The paper addresses these issues by proposing a continuous-time reformulation of QRL based on Partial Differential Equations (PDEs) and a hierarchical architecture to handle complex dynamics.

2. Methodology

The proposed solution consists of two main components: Eik-QRL (the continuous-time base) and Eik-HiQRL (the hierarchical extension).

A. Eikonal-Constrained QRL (Eik-QRL)

The authors reformulate the discrete local constraints of QRL into a continuous-time setting using the Eikonal Partial Differential Equation (PDE).

From HJB to Eikonal: The standard Hamilton-Jacobi-Bellman (HJB) equation describes optimal control in continuous time. However, solving HJB is difficult in high dimensions. The authors simplify the dynamics to unit-speed, isotropic dynamics ( $\dot{s} = a$ with $\|a\| \leq 1$ ). Under this assumption, the HJB equation reduces to the Eikonal equation:
$\|\nabla_s d(s, g)\| = 1$
This implies that the gradient of the value function (distance) must have a unit norm everywhere, representing a constant speed of travel toward the goal.
Trajectory-Free Learning: Unlike QRL, which requires transition pairs $(s, s')$ , Eik-QRL is trajectory-free. It only requires independent and identically distributed (i.i.d.) samples of states $s$ and goals $g$ .
Optimization Objective: The algorithm maximizes global relationships (encouraging large distances between far states) while enforcing the Eikonal constraint as a soft penalty:
$\max_{\theta} \mathbb{E}_{s,g}[\zeta(d_\theta(s, g))] \quad \text{s.t.} \quad \mathbb{E}_{s,g}[(\|\nabla_s d_\theta(s, g)\| - 1)^2] \leq \epsilon^2$
This is implemented using Physics-Informed Neural Networks (PINNs), where automatic differentiation computes the gradient norm loss.
Benefits:
- Implicit Regularization: The PDE constraint acts as a global regularizer, improving OOD generalization and stability.
- State-Space Coverage: Every sampled pair $(s, g)$ provides a full gradient vector, coupling all coordinate directions, unlike QRL which only constrains observed transitions.

B. Eikonal-Constrained Hierarchical QRL (Eik-HiQRL)

While Eik-QRL offers structural benefits, the assumption of isotropic dynamics is violated in complex environments (e.g., robotic manipulation with contacts). To address this, the authors introduce a hierarchical decomposition:

High-Level (Abstract Space): A quasimetric value function $d_h$ is learned in a low-dimensional abstract space $\bar{S}$ (e.g., agent coordinates or a learned latent space). In this reduced space, the isotropic dynamics assumption is more plausible, allowing the Eikonal constraint to effectively regularize the value function. The high-level policy outputs subgoals.
Low-Level (Original Space): A standard temporal-difference (TD) based controller learns to reach the subgoals generated by the high-level policy. This component handles the complex, non-smooth dynamics of the original environment.
Synergy: The hierarchy mitigates the signal-to-noise ratio problem in long-horizon tasks and allows the PDE-based regularization to be applied where it is most valid (the abstract space), while the low-level controller handles the "messy" real-world dynamics.

3. Key Contributions

Eik-QRL Formulation: A novel, continuous-time, trajectory-free formulation of Quasimetric RL derived from the Eikonal PDE. It provides theoretical guarantees for value recovery under Lipschitz conditions and improves generalization via PDE-based regularization.
Eik-HiQRL Algorithm: A hierarchical framework that integrates Eik-QRL into a high-level planner operating in an abstract space. This design overcomes the limitations of isotropic assumptions in complex dynamics while preserving the benefits of PDE constraints.
Theoretical Analysis: The paper provides proofs showing that under specific regularity conditions (1-Lipschitz optimal value), a universal quasimetric approximator trained with Eik-QRL can recover the optimal value function with high probability. It also analyzes the joint benefits of hierarchy and quasimetric projection in reducing policy error bounds.
Comprehensive Evaluation: Extensive experiments on the OGbench suite (navigation and manipulation) demonstrating state-of-the-art (SOTA) performance.

4. Experimental Results

The authors evaluated their methods on the OGbench benchmark, covering navigation (PointMaze, AntMaze, HumanoidMaze) and manipulation (AntSoccer, Cube, Scene).

Navigation (Isotropic Dynamics):
- In PointMaze (idealized, isotropic), Eik-QRL and Eik-HiQRL achieved comparable or superior performance to standard QRL, with significantly lower collision rates.
- In AntMaze (complex, non-Lipschitz dynamics), pure Eik-QRL performance degraded compared to PointMaze but remained competitive. However, Eik-HiQRL achieved SOTA performance, significantly outperforming QRL, HJB-QRL, and other baselines (HIQL, CRL), especially in "Giant" and "Stitch" (long-horizon) variants.
- Collision Avoidance: Eik-HiQRL consistently demonstrated lower collision rates, indicating better adherence to the geometric constraints.
Manipulation & Non-Regular Environments:
- In tasks with external objects and contact discontinuities (AntSoccer, Cube, Scene), the theoretical assumptions of Eik-QRL are violated.
- Eik-HiQRL matched the performance of strong baselines (like HIQL and CRL) in these settings, proving that the hierarchical structure successfully mitigates the violation of regularity assumptions.
- The paper notes that while gains were smaller here than in navigation, the method remained robust, whereas pure PDE-constrained methods without hierarchy struggled.
Trajectory-Free & Online Settings:
- Experiments in a fully trajectory-free setting (random state/goal sampling without transitions) showed Eik-QRL could learn effective controllers, validating the trajectory-free claim.
- In Online RL settings, Eik-HiQRL outperformed flat baselines, confirming its applicability beyond offline learning.

5. Significance and Impact

Bridging Model-Free and Model-Based: The work introduces a "hybrid" regime where simplified dynamical models (via PDEs) impose explicit constraints on model-free value learning, offering a bridge between the two paradigms without requiring a full learned world model.
Generalization: The trajectory-free, PDE-based approach significantly improves out-of-distribution generalization, a critical bottleneck in offline RL.
Scalability: By decomposing the problem hierarchically, the method scales to long-horizon tasks and high-dimensional state spaces where direct quasimetric learning fails.
Theoretical Foundation: It establishes a rigorous link between quasimetric learning and continuous-time optimal control (Eikonal equations), providing new theoretical guarantees for GCRL algorithms.

In summary, Eik-HiQRL represents a significant advancement in Goal-Conditioned RL by leveraging the geometric properties of value functions through continuous-time PDE constraints and hierarchical abstraction, achieving robust performance across diverse and challenging environments.