Imagine you are teaching a robot to navigate a massive, complex maze. The old way of doing this was to give the robot a specific destination (like "go to the red door") and let it figure out every single step to get there. But what if you wanted the robot to learn how to handle any kind of reward, not just finding a door? Maybe you want it to collect coins, avoid traps, or find a specific pattern of colors.

This paper introduces a new way to teach robots called Switching Successor Measures. Here is the simple breakdown of how it works, using everyday analogies.

The Problem: The "Fixed Step" Trap

Previous methods tried to break big problems into smaller ones by saying, "Take exactly 10 steps, then stop and pick a new goal."

The Flaw: Imagine trying to walk across a room. If you force yourself to take exactly 10 steps every time you change your mind, you might end up in the middle of a wall or a puddle. Real life isn't about fixed steps; it's about reaching a specific spot (like a chair) and then deciding what to do next. The old methods were too rigid and only worked well for simple "find the goal" tasks.

The Solution: The "Smart Switch"

The authors propose a system where the robot learns two things at the same time from a single "map" of the world:

The High-Level Plan: "I need to get to that chair first."
The Low-Level Action: "Okay, I'm walking toward the chair."

The magic trick is called Switching Successor Measures. Think of it like a GPS that doesn't just show you the route to the final destination, but also understands the "value" of stopping at any intermediate point.

The Analogy: Imagine you are hiking.
- Old Way: You have a map that only tells you how to get to the summit. If you want to stop at a waterfall halfway up, you have to re-calculate the whole map from scratch.
- New Way (This Paper): You have a "Super Map" that knows the terrain. It tells you: "If you head toward the waterfall, you will get there in 5 minutes. Once you are there, you can instantly switch your plan to head toward the summit." The robot learns to "switch" its focus from one sub-goal to another seamlessly, without needing a new map or a teacher to tell it exactly when to switch.

How It Works (The "FB π-Switch" Algorithm)

The paper calls their method FB π-Switch. Here is the process in plain English:

Learning the "Feel" of the World: First, the robot looks at a bunch of old videos of itself (or others) moving around. It learns a "successor measure."
- Analogy: This is like learning the "vibe" of every room in a house. You know that if you are in the kitchen, you are likely to end up in the dining room soon. You don't need to know the exact path every time; you just know the probability of where you'll be.
The "Switch" Moment: The robot learns that it can follow a path to a sub-goal (like the kitchen), and the moment it gets there, it can "switch" its internal logic to start heading toward the final goal (the dining room).
No Extra Training: The best part is that the robot figures out how to break the big task into small pieces all by itself. It doesn't need a human to say, "Stop here and pick a new goal." The structure of the math naturally creates these sub-goals.

Why It Matters

The researchers tested this on two types of tasks:

Goal-Conditioned: "Go to the red flag." (Like a standard video game level).
General Rewards: "Collect as many coins as possible while avoiding spikes." (A much harder, more complex task).

The Results:

The new method worked just as well as the best existing methods for simple "go to the flag" tasks.
Crucially, it was much better at the complex "collect coins" tasks. Because it wasn't stuck using fixed steps, it could adapt to complex reward landscapes where the best path wasn't a straight line.

The Bottom Line

This paper shows that you don't need to manually design complex hierarchies or tell a robot exactly when to switch tasks. By using a specific mathematical framework (Switching Successor Measures), a robot can learn a single, flexible "understanding" of the world that naturally allows it to break big problems into smaller, manageable steps on its own. It's like giving the robot a brain that can naturally see the "big picture" and the "small steps" at the same time.

Technical Summary: Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

Problem Statement

Hierarchical Reinforcement Learning (HRL) aims to improve generalization by decomposing long-horizon decision-making into simpler subproblems. However, existing approaches often rely on restrictive design choices, such as fixed temporal abstractions or goal-conditioned objectives, which limit their applicability to general reward functions. Furthermore, methods like HIQL enforce locality through fixed subgoal horizons rather than allowing it to emerge from learning.

Concurrently, Successor Measures (SM), particularly via Forward-Backward (FB) representations, offer a framework for zero-shot adaptation to arbitrary reward functions by representing value functions in a shared embedding space. Yet, these methods typically assume a strong global factorization ( $F(s, a, z)^\top B(g)$ ) that may be difficult to learn in complex environments. Recent findings suggest that successor representations are most reliable locally, capturing short-range transitions effectively while accuracy degrades over long horizons.

A critical gap exists: there is no unified approach that leverages successor representations to derive hierarchical policies directly from the learned representation while retaining the ability to generalize across arbitrary (non-goal-reaching) reward functions. Current pipelines often separate representation learning from policy learning, failing to exploit the structural encoding of successor representations for joint planning and control.

Methodology: Switching Successor Measures and FB $\pi$ -Switch

The authors introduce Switching Successor Measures, a framework that enables hierarchical control in zero-shot RL without additional supervision, fixed horizons, or manually designed subgoals.

Theoretical Foundation

The core insight is that the switching successor measures required for high-level planning can be derived directly from a single classical successor measure.

Switching Advantage: The authors define a $k$ -step advantage function where an agent follows a subgoal-conditioned policy $\pi_w$ for $k$ steps and then switches to a globally efficient policy $\pi$ .
Hitting-Time Switching: To address the bias introduced by fixed horizons (where $k$ steps may not align with reaching a subgoal), the framework replaces fixed $k$ with the hitting time $H^{\pi_w}_s(w)$ .
Theorem 1: The paper establishes a key identity relating the switching successor measure $M^{\pi_w \to \pi}_s$ to standard successor measures:
$M^{\pi_w \to \pi}_s(s') = M^{\pi_w}_s(s') + \frac{M^{\pi_w}_s(w)}{M^{\pi_w}_w(w)} \left( M^{\pi}_w(s') - M^{\pi_w}_w(s') \right)$
This theorem demonstrates that hierarchy is implicitly encoded in standard successor representations and can be recovered without additional learning.
Corollary 1: The switching advantage function is derived as:
$A^{\pi_w \to \pi}_s(r) = V^{\pi_w}(s; r) + \frac{M^{\pi_w}_s(w)}{M^{\pi_w}_w(w)} \left( V^{\pi}(w; r) - V^{\pi_w}(w; r) \right) - V^{\pi}(s; r)$
This serves as the objective for the high-level policy to select subgoals $w$ .

Algorithm: FB $\pi$ -Switch

The authors propose FB $\pi$ -Switch, a three-stage offline learning algorithm:

State-Successor Representation Learning: The algorithm learns action-free, reward-conditioned state-successor representations ( $F(s, z)$ and $B(s)$ ) using an expectile regression objective. This step marginalizes over actions and avoids the coupled optimization of policy and representation found in standard FB, enabling a single-step learning procedure.
High-Level Policy Learning: A high-level policy $\pi_h$ is trained to select latent subgoals $z_w$ by maximizing the FB-approximation of the switching advantage function using Advantage-Weighted Regression (AWR).
Low-Level Policy Learning: A low-level policy $\pi_\ell$ is trained to execute primitive actions conditioned on the selected subgoal, also using AWR.

The method allows for hierarchical post-tuning, where the high-level policy can be added to pretrained behavior foundation models (BFMs) without retraining the low-level controller or the base representations.

Key Contributions

Switching Successor Measures: A principled framework for extracting hierarchical structure from successor-based representations. The paper proves that the necessary measures for high-level planning are derivable from a single classical successor measure, showing hierarchy is implicitly encoded.
FB $\pi$ -Switch Algorithm: A hierarchical zero-shot RL algorithm where both high-level subgoal selection and low-level control are derived directly from FB representations. The method follows a three-stage training procedure, with the high-level stage being compatible with existing FB algorithms.
Empirical Validation: Evaluation on both goal-conditioned and general reward-based tasks demonstrates that FB $\pi$ -Switch improves over non-hierarchical baselines and matches state-of-the-art hierarchical methods in goal-conditioned settings.

Experimental Results

The authors evaluated FB $\pi$ -Switch on discrete mazes, AntMaze (goal-conditioned), and AntMaze with general reward functions.

Goal-Conditioned Tasks (AntMaze): FB $\pi-Switch$ achieved performance comparable to HIQL, a leading hierarchical method. Notably, adding a high-level policy consistently improved performance over non-hierarchical variants. Even without hierarchy, FB $\pi$ -Switch outperformed other non-hierarchical baselines (e.g., standard FB, ICVF).
General Reward Tasks: In environments with distributed reward landscapes (not single-goal reaching), FB $\pi$ -Switch achieved the best average performance. The hierarchical variant showed improved robustness across environments.
Ablation and Analysis:
- The high-level policy in FB $\pi$ -Switch induces subgoals that lie along coherent trajectories toward the goal, whereas HIQL's subgoals often lead to similar immediate actions without path consistency.
- Experiments combining high-level policies from FB $\pi$ -Switch with low-level policies from standard FB showed that the quality of the low-level policy is crucial for realizing hierarchical benefits.
- The method successfully handles tasks with trade-offs between local and global objectives, extending beyond the single-target focus of traditional GCRL.

Significance and Claims

The paper claims that structured successor representations provide a flexible foundation for hierarchical zero-shot reinforcement learning that extends beyond goal-reaching tasks. By introducing switching successor measures, the authors demonstrate that hierarchical behavior can emerge directly from learned representations without requiring:

Expert trajectories.
Hand-crafted temporal abstractions.
Separate supervised objects for hierarchy.

The work bridges local and global decision-making, suggesting that successor representations can naturally support the composition of behaviors across regions. The authors note that while the method is effective, the quality of the base model remains a dependency, and future work could explore multi-subgoal planning and comparisons with generative model-based methods. The framework is presented as a unified approach to derive hierarchical policies from successor representations, addressing the gap between representation learning and hierarchical control in zero-shot settings.

Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning