Understanding and Improving Hyperbolic Deep Reinforcement Learning

Imagine you are teaching a robot to play a complex strategy game, like chess or a video game where it has to eat smaller fish to grow bigger. Every move the robot makes branches out into thousands of new possibilities, creating a massive, ever-expanding tree of "what could happen next."

The Problem: The Wrong Map

For a long time, AI researchers have tried to teach robots using Euclidean geometry. Think of this like drawing a map on a flat sheet of graph paper.

The Issue: On a flat sheet, space grows slowly (polynomially). But the game tree grows explosively (exponentially).
The Analogy: It's like trying to fit a giant, sprawling city with millions of streets onto a single, flat postcard. To make it fit, you have to squish and stretch the streets until the map is distorted. The robot gets confused because the "distance" between two related moves looks wrong on this flat map. This leads to the robot learning slowly or getting stuck.

The Solution: A Better Map (Hyperbolic Geometry)

The authors of this paper suggest using Hyperbolic geometry.

The Analogy: Imagine a Möbius strip or a coral reef. As you move away from the center, the space expands incredibly fast. This shape naturally fits the "tree-like" structure of decision-making. You can map the entire game tree onto this shape without squishing or distorting it.
The Promise: If the robot uses this "coral reef" map, it should understand the game's hierarchy much better and learn faster.

The Catch: The Map is Unstable

Here is the twist: While the "coral reef" map is theoretically perfect, it's incredibly hard to use.

The Problem: When the robot tries to learn on this curved map, the math gets messy. The numbers representing the robot's knowledge (called "embeddings") tend to grow too large, like a balloon inflating until it pops.
The Consequence: When these numbers get too big, the robot's "brain" (the neural network) starts to glitch. The training signal becomes noisy, the robot forgets what it learned, and the whole process crashes. Previous attempts to fix this were like trying to hold the balloon down with a heavy weight (SpectralNorm), which stopped it from popping but also stopped it from growing big enough to be useful.

The Fix: HYPER++

The authors introduce a new system called HYPER++. They didn't just try to patch the old map; they redesigned the whole driving system to handle the unique terrain. They used three main tricks:

The Speed Governor (RMSNorm & Scaling):
Instead of using a heavy weight to stop the balloon, they installed a smart "speed governor." This keeps the robot's knowledge numbers within a safe, healthy range. It prevents the numbers from exploding (which causes crashes) but still lets them grow enough to be useful. It's like cruise control that keeps the car fast but safe, rather than slamming on the brakes.
Switching the Vehicle (Hyperboloid Model):
They realized that the specific type of "coral reef" they were using (the Poincaré Ball) was too slippery and unstable for high-speed learning. They switched to a slightly different shape called the Hyperboloid.
- The Analogy: It's like switching from a bouncy, unstable trampoline to a solid, curved slide. The slide still has the same "expanding space" benefits, but the math is much smoother and less prone to glitches.
Changing the Scorecard (Categorical Value Loss):
In the old system, the robot tried to guess a single exact number for "how good is this move?" (like guessing a temperature). On a curved map, this is hard.
- The Fix: They changed the game. Instead of guessing one number, the robot now guesses a range of buckets (like "Is the temperature Cold, Warm, or Hot?"). This "categorical" approach is much more stable and fits the geometry of the curved map perfectly.

The Results

When they tested this new system:

Faster Learning: The robot learned about 30% faster (in real-time clock speed) because it didn't crash or get stuck.
Better Performance: It beat all previous attempts at using curved maps and even outperformed standard flat-map robots in many games.
Versatility: It worked not just on one type of learning algorithm, but on several different ones, proving it's a robust solution.

Summary

The paper is about realizing that while curved maps are the perfect way to understand complex, branching decisions, they are notoriously difficult to drive on. The authors built a new HYPER++ vehicle with better suspension (regularization), a smoother road (Hyperboloid model), and a better navigation system (categorical loss) to finally make hyperbolic deep reinforcement learning practical and powerful.

1. Problem Statement

Reinforcement Learning (RL) agents often operate in environments with inherently hierarchical structures (e.g., game trees, state transitions). Euclidean space is geometrically mismatched for these structures because its volume grows polynomially with radius, whereas hierarchical data grows exponentially. Hyperbolic geometry, with its exponential volume growth, offers a natural fit for low-distortion embeddings of such data.

However, hyperbolic deep RL faces severe optimization challenges that have limited its adoption:

Training Instability: Agents using hyperbolic layers (specifically in Proximal Policy Optimization, PPO) often suffer from early entropy collapse, large policy updates, and trust-region violations.
Gradient Pathologies: The paper identifies that large-norm embeddings in the encoder lead to exploding gradients in hyperbolic layers, particularly due to the conformal factor in the Poincaré Ball model.
Ineffective Stabilization: Existing stabilization techniques, such as applying SpectralNorm to the entire encoder, are insufficient. They either fail to prevent norm growth or severely limit the network's representational capacity and computational efficiency.

2. Methodology: HYPER++

The authors propose HYPER++, a hyperbolic deep RL agent designed to stabilize training through three core components that address specific failure points identified via formal gradient analysis.

A. Theoretical Diagnosis

The authors performed a formal gradient analysis of core operations in both the Poincaré Ball and Hyperboloid models.

Poincaré Ball Issues: The conformal factor $\lambda_c^x = \frac{2}{1-c\|x\|^2}$ causes gradients to explode as embeddings approach the boundary ( $\|x\| \to 1/\sqrt{c}$ ).
Hyperboloid Issues: While the Hyperboloid model lacks a conformal factor, its exponential map Jacobian still contains terms ( $\sinh, \cosh$ ) that grow exponentially with the Euclidean feature norm, leading to instability if norms are not bounded.
Conclusion: Large Euclidean embedding norms are the root cause of instability in both models.

B. Key Architectural Components

HYPER++ introduces a hybrid architecture with the following stabilizing mechanisms:

RMSNorm + Learned Feature Scaling (Replacing SpectralNorm):
- Instead of applying SpectralNorm to every layer (which reduces expressivity), the authors apply RMSNorm (Root Mean Square Layer Normalization) to the pre-activation output of the final Euclidean encoder layer.
- This is combined with a learned scaling layer ( $\xi_\theta$ ) that rescales the normalized features.
- Benefit: This guarantees bounded embedding norms (preventing the conformal factor explosion) without sacrificing the Lipschitz constant or capacity of the earlier encoder layers. It is computationally cheaper than SpectralNorm.
Hyperboloid Model:
- The agent utilizes the Hyperboloid model instead of the Poincaré Ball.
- Benefit: The Hyperboloid MLR (Multinomial Logistic Regression) score function does not rely on a conformal factor, removing a primary source of numerical instability. The stabilization strategy (RMSNorm + scaling) is applied to bound the "time component" ( $x_0$ ) of the Hyperboloid, ensuring the space components remain bounded.
Categorical Value Loss:
- Standard RL critics often use Mean Squared Error (MSE) regression. However, hyperbolic MLR layers output signed distances to hyperplanes, which aligns better with classification tasks.
- HYPER++ replaces MSE with a categorical value loss (specifically HL-Gauss or Categorical distributional loss).
- Benefit: This aligns the critic's output geometry with the hyperbolic representation, stabilizing learning under non-stationary targets and reducing gradient variance.

3. Key Contributions

Characterization of Training Issues: The paper provides the first formal gradient analysis linking large-norm embeddings to trust-region failures in hyperbolic PPO agents for both Poincaré and Hyperboloid models.
Principled Regularization: It proposes a novel combination of RMSNorm and learned scaling that bounds norms effectively while preserving network capacity, overcoming the trade-offs of SpectralNorm.
HYPER++ Agent: A general, high-performance hyperbolic agent that integrates the Hyperboloid model, specific regularization, and categorical losses.
Empirical Validation: Demonstrates that hyperbolic representations can outperform Euclidean baselines in deep RL when optimization is handled correctly.

4. Experimental Results

The authors evaluated HYPER++ on ProcGen (16 environments) and Atari-5 (5 games) using PPO, Phasic Policy Gradient (PPG), and Double DQN (DDQN).

ProcGen (PPO):
- Performance: HYPER++ outperformed both Euclidean baselines and previous hyperbolic agents (Hyper+S-RYM). It achieved a 52.3% relative improvement in test rewards over the unregularized hyperbolic baseline.
- Efficiency: Reduced wall-clock training time by approximately 30% compared to SpectralNorm-based baselines due to the removal of power-iteration steps.
- Stability: Showed significantly lower update KL divergence and clipping fractions, indicating stable trust-region adherence.
ProcGen (PPG):
- HYPER++ outperformed the Euclidean baseline and the Hyper+S-RYM agent (which failed to beat the Euclidean baseline in this setting).
Atari-5 (DDQN):
- HYPER++ strongly outperformed both Euclidean and prior hyperbolic baselines across all five games, demonstrating that the method generalizes to off-policy algorithms (DDQN) beyond on-policy methods (PPO).
Ablation Studies:
- Removing RMSNorm or learned scaling caused complete training failure (vanishing gradients or exploding norms).
- Using MSE loss instead of categorical loss degraded performance.
- Using the Poincaré Ball instead of the Hyperboloid resulted in a modest performance drop, confirming the Hyperboloid's robustness.

5. Significance

This paper is significant because it moves hyperbolic deep learning from a theoretical curiosity to a practical, stable tool for Reinforcement Learning.

Solves the Optimization Bottleneck: It identifies that the failure of hyperbolic RL was not the geometry itself, but the lack of appropriate regularization and loss functions for the specific dynamics of RL (non-stationarity).
Generalizability: The method works across different algorithms (PPO, PPG, DDQN) and environments, suggesting a broad applicability.
Efficiency: By replacing computationally expensive SpectralNorm with RMSNorm, it makes hyperbolic RL more scalable.
Geometric Alignment: It highlights the importance of matching the loss function (categorical) to the geometry of the representation (hyperbolic MLR), a principle that could apply to other non-Euclidean deep learning tasks.

The authors release their code to ensure reproducibility, addressing previous difficulties in reproducing hyperbolic RL results due to implementation nuances.