Latent Poincar\'e Shaping for Agentic Reinforcement Learning

The Big Idea: Teaching AI to "Think" Like a Tree

Imagine you are trying to teach a robot to solve a very hard math problem.

Old Way: You ask the robot, "What's the answer?" It thinks for a second and spits out one long string of text. If it's wrong, you tell it "No," and it tries again from scratch. It's like asking a student to write an essay in one breath; if they stumble on the third word, the whole thing is ruined.
The New Way (LaPha): Instead of one long breath, we tell the robot: "Think step-by-step. If you get stuck, branch off and try a different path." This creates a Tree of Thoughts. The robot explores many different possibilities (branches) before picking the best one.

The problem is that this "Tree" gets huge, very fast. If the robot has 10 choices at every step, after just 5 steps, it has 100,000 paths to check. Most of them are dead ends. Checking them all is a waste of time and energy.

The Problem: The "Crowded Room" of Ideas

Current AI models live in a "flat" mental space (like a flat sheet of paper). When you try to map a giant, branching tree of thoughts onto a flat sheet, everything gets squished together at the edges.

The Analogy: Imagine trying to hang 1,000 different coats in a small, flat closet. Eventually, they all pile up on top of each other. You can't tell the "Red Coat" from the "Blue Coat" anymore because they are all crammed in the same spot.
The Result: The AI gets confused. It can't tell the difference between a "good" thought and a "bad" thought because they look too similar in its flat mental space.

The Solution: The "Poincaré Ball" (The Infinite Funnel)

The authors of this paper, LaPha, decided to stop using a flat closet and move the AI's thinking into a Hyperbolic Space (specifically, a Poincaré ball).

The Analogy: Imagine a magical, infinite funnel.
- The center of the funnel is where the robot starts (the prompt).
- As the robot thinks deeper and deeper (moving toward the edge of the funnel), the space expands exponentially.
- Near the center, there is little room. But near the edge, there is infinite room.
Why this helps: In this magical funnel, you can hang 1,000,000 coats, and they will all have their own perfect spot. The "Red Coat" and the "Blue Coat" are far apart, even if they look similar. The AI can clearly see the difference between a good path and a bad path because the geometry of the space naturally separates them.

The Magic Trick: "Shaping" the Rewards

In the past, when an AI tried to solve a problem, it only got a reward at the very end: "Correct!" or "Wrong!"

The Problem: If the AI takes 20 steps to solve a problem, and it gets "Wrong" at the end, it doesn't know which of the 20 steps was the mistake. It's like playing a game of "Hot and Cold" but only being told "You lost" at the very end.

LaPha's Fix:
Because the AI is now in this magical funnel, the authors created a GPS system for the AI.

They measure the distance from the current thought to the "Goal" (the correct answer) using the geometry of the funnel.
As the AI moves closer to the goal, it gets a tiny "Good job!" reward at every single step.
This is called Reward Shaping. Instead of waiting for the final grade, the AI gets a gold star for every step that moves it closer to the solution.

The "Lightweight" Brain

Usually, to guide this search, you need a second, massive AI brain to tell the first one which path is good. This is slow and expensive.

LaPha's Trick: They attached a tiny, simple "value head" (a small calculator) to the main AI. Because the AI is already thinking in this magical funnel, this tiny calculator can easily see which paths are "close" to the goal.
Result: The AI can now guide its own search at test time (when solving new problems) without needing a massive, slow external brain. It scales up its own intelligence just by thinking a bit longer and checking more branches.

The Results: Supercharging Math Skills

The paper tested this on some of the hardest math competitions (like AIME and MATH-500).

Before: A small AI model (1.5 billion parameters) got about 30% on hard math problems.
After LaPha: That same small model jumped to 56%.
With Self-Guided Search: When the model was allowed to "think longer" (run more branches of the tree) using its new GPS, it hit 88% on standard math tests and even beat some of the world's most advanced models (like GPT-o1-mini) on specific hard challenges.

Summary

LaPha is like giving an AI a magic, expanding map (the Poincaré ball) instead of a flat piece of paper. This map prevents the AI's thoughts from getting crowded. It also gives the AI a GPS that rewards it for every step closer to the answer, rather than just waiting for the final result. This allows even small, cheap AI models to solve incredibly hard problems by "thinking" smarter and deeper.

1. Problem Statement

Large Language Models (LLMs) typically generate solutions in a single pass, which is suboptimal for complex reasoning tasks requiring multi-step planning, tool use, and self-correction. While Reinforcement Learning with Verifiable Rewards (RLVR) and Monte Carlo Tree Search (MCTS) have been proposed to address this, they face two critical bottlenecks:

Semantic Aliasing: In natural language, the action space is vast and unstructured. Many distinct token sequences (paraphrases, formatting variants) represent the same semantic meaning. Standard token-space exploration wastes compute on near-duplicates.
Sparse Rewards: In math and logic tasks, rule-based verifiers only provide a binary signal (correct/incorrect) at the leaf nodes of a search tree. This sparsity makes credit assignment difficult and prevents the model from learning intermediate progress.
Geometric Mismatch: Standard Euclidean or spherical latent spaces struggle to represent deep reasoning trees. They suffer from "crowding," where deep states are packed into limited capacity, eroding the distance contrast needed to measure progress toward a goal.

2. Methodology: LaPha

The authors propose LaPha (Poincaré Latent AlPhaZero-like RL), a framework that maps the LLM's hidden states into a root-centered Poincaré ball (a hyperbolic space) to serve as a shared interface for search, reward shaping, and value estimation.

A. Root-Centered Poincaré Latent Space

Instead of operating on token sequences, LaPha operates on the backbone's hidden states.

Mapping: For a node $i$ with dialogue prefix $s_i$ , the model generates hidden states $h^{(i)}$ . These are mean-pooled to a vector $\bar{h}_i$ .
Translation & Projection: The vector is translated relative to the root prompt ( $\bar{h}_i - \bar{h}_0$ ) and mapped to the Poincaré ball $D_H$ using the exponential map at the origin:
$y_i = \exp_0\left(\frac{\bar{h}_i - \bar{h}_0}{\sqrt{H}}\right)$
Geometric Advantage: The Poincaré ball has negative curvature. As points move away from the origin (the prompt), the available volume expands exponentially. This naturally matches the combinatorial growth of a reasoning tree, preventing the "crowding" seen in Euclidean spaces and preserving distance contrast between deep branches.

B. Geodesic Potential Shaping

To solve the sparse reward problem, LaPha defines a dense process reward based on the geometric distance to the goal.

Potential Function: For any node $i$ , the potential $V(i)$ is defined by its hyperbolic geodesic distance to the root ( $d_{root}$ ) and its distance to the nearest verified-correct leaf ( $d_{goal}$ ):
$V(i) = \frac{d_{root}(i)}{d_{root}(i) + d_{goal}(i)}$
Rewards: The step reward for a transition $(i \to j)$ is the difference in potential: $r(i, j) = V(j) - V(i)$ . This converts the sparse terminal signal into a dense gradient that guides the agent toward correctness at every step.

C. AlphaZero-like MCTS with Latent Pruning

Value Head: A lightweight linear value head is trained on the same pooled latent states to predict $V(i)$ . This allows for self-guided search at test time without needing a separate, heavy reward model.
Latent-Space Pruning: To combat semantic aliasing, the search frontier is periodically pruned in the Poincaré space. Nodes are clustered by hyperbolic distance, and redundant branches (paraphrastic near-duplicates) are disabled. This forces the search to explore diverse semantic regions rather than re-exploring the same solution in different words.

D. Policy Optimization

The framework uses Dr. GRPO (Group Relative Policy Optimization) to update the policy. It aggregates the dense step rewards along a trajectory to compute a group advantage, optimizing the policy to maximize the likelihood of high-potential paths.

3. Key Contributions

Hyperbolic Latent Geometry for Reasoning: The paper demonstrates that negative-curvature spaces (Poincaré ball) are superior to Euclidean spaces for representing the hierarchical structure of reasoning trees, effectively solving the "crowding" problem.
Dense Reward Shaping via Geometry: It introduces a novel method to derive dense, process-level rewards from sparse terminal verifications using geodesic distances, significantly improving credit assignment.
Unified Search-Learning Interface: By sharing the Poincaré latent space for the policy, value head, and pruning mechanism, LaPha enables efficient test-time scaling with minimal overhead.
Semantic Pruning: The introduction of hyperbolic clustering for pruning reduces search redundancy caused by linguistic paraphrasing.

4. Experimental Results

The method was evaluated on mathematical reasoning benchmarks (MATH-500, AIME'24, AIME'25, OlympiadBench, Gaokao'23) using Qwen2.5 models (1.5B and 7B).

Performance Gains:
- Qwen2.5-Math-1.5B: Improved from 66.0% to 88.2% on MATH-500.
- AIME'24 (7B): LaPha-7B achieved 60.0% accuracy, outperforming o1-mini (56.7%) and other strong baselines.
- AIME'25 (7B): Achieved 53.3%, surpassing o1-mini (51.7%).
Test-Time Scaling: With value-guided MCTS (128 simulations), the 1.5B model reached 56.7% on AIME'24, demonstrating that the learned value head effectively scales performance without external models.
Ablation Studies:
- Geometry: Poincaré shaping significantly outperformed both binary (0/1) rewards and Euclidean distance shaping. Euclidean shaping performed worse than binary rewards due to poor separation of deep states.
- Pruning: Enabling latent-space pruning reduced training loss and improved generalization by preventing the model from collapsing into semantic clusters of paraphrases.
- Joint Training: Allowing the value head gradients to update the shared backbone improved both value calibration and final accuracy compared to freezing the backbone.

5. Significance

LaPha represents a paradigm shift in agentic RL for LLMs by moving the search and learning interface from the discrete, high-entropy token space to a continuous, geometrically structured latent space.

Efficiency: It enables small models (1.5B) to rival or surpass much larger frontier models (like o1-mini) on complex reasoning tasks.
Scalability: The method provides a principled way to scale test-time compute via self-guided search without the computational cost of training massive reward models.
Generalizability: The geometric insights regarding negative curvature and potential-based shaping offer a new direction for handling hierarchical decision-making in any domain where the action space is combinatorial and the reward is sparse.

In summary, LaPha successfully aligns the geometry of the latent space with the combinatorial nature of reasoning trees, enabling more efficient exploration, denser learning signals, and superior performance in agentic reasoning tasks.

Latent Poincaré Shaping for Agentic Reinforcement Learning