Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

Imagine you are teaching a robot to drive a ship through the world's busiest shipping lane (the Singapore Strait). You have a massive logbook of how expert captains have sailed in the past, but you can't let the robot practice by crashing into other ships in real life. You need to teach it using only that old logbook.

This is the challenge of Offline Safe Reinforcement Learning. The robot needs to learn to get to its destination (maximize reward) without hitting anything or running out of fuel (safety constraints).

The problem with most current methods is that they try to balance these two goals at the same time, like trying to walk a tightrope while juggling. It's unstable, often leads to the robot taking dangerous risks, or requires complex math that breaks easily.

This paper introduces a new method called BCRL (Budget-Conditioned Reachability). Here is how it works, explained with simple analogies:

1. The "Gas Tank" Analogy (The Budget)

Imagine the robot has a fuel tank (a safety budget) at the start of every trip.

The Old Way: The robot tries to drive fast to get to the goal, hoping it doesn't run out of gas. If it runs out, it crashes. The math tries to guess the perfect balance between speed and fuel, which is very hard to get right.
The New Way (BCRL): The robot is given a strict rule: "You can never drive unless you have enough fuel left to guarantee you can reach a safe harbor."

2. The "Safety Map" (Reachability)

Before the robot even starts driving, the system draws a Safety Map.

This map marks every spot on the ocean where, if the robot is there with its current amount of fuel, it is guaranteed to be able to reach safety without crashing, no matter what happens next.
If the robot is in a spot where it might run out of fuel before reaching safety, that spot is marked "Danger Zone."

3. The "Dynamic Budget" (The Magic Trick)

Here is the clever part. The robot doesn't just have a fixed amount of fuel; it has a dynamic budget.

Every time the robot takes a step, it "spends" a little bit of its budget (based on how risky that move was).
The system constantly updates the map: "Okay, you spent some fuel. Do you still have enough to reach safety from your new position?"
If the answer is No, the robot is instantly forbidden from taking that step. It's like a traffic cop who says, "You can't turn left there; you won't have enough gas to get out of that alley."

4. Why This is Better (Decoupling)

Think of learning to drive as having two teachers:

Teacher A (The Reward Teacher): "Drive as fast as possible to get to the goal!"
Teacher B (The Safety Teacher): "Don't crash! Stay within the safe zone!"

Old Methods: These teachers argue with each other constantly. Teacher A pushes the pedal, Teacher B slams the brakes. The robot gets confused, and the math gets unstable (like a min-max game where they try to outsmart each other).

BCRL Method:

Step 1: We let Teacher B work alone first. We calculate the Safety Map and the Fuel Rules completely independently. We figure out exactly which moves are safe for any amount of fuel.
Step 2: We lock the robot's steering wheel to only allow moves that Teacher B approves.
Step 3: Now, Teacher A can teach the robot to drive as fast as possible, but only within the safe lanes Teacher B drew.

Because the safety rules are set in stone before the robot tries to be fast, there is no arguing. The robot learns faster, stays safe, and doesn't need to guess.

The Real-World Test

The authors tested this on a real-world simulation of ships in the Singapore Strait.

The Result: Their robot (BCRL) learned to navigate safely, avoiding close calls with other ships, while still getting to its destination efficiently.
Comparison: It performed better than other state-of-the-art methods, which either crashed too often or were too slow and cautious. It also ran much faster (training took minutes instead of hours).

In a Nutshell

Instead of trying to solve the puzzle of "Speed vs. Safety" simultaneously, BCRL solves the "Safety" puzzle first to create a safe playground. Then, it lets the robot run wild inside that playground to learn how to be fast. It's like building a fence around a playground first, and then letting the kids play as hard as they want inside it, knowing they can't get hurt.

1. Problem Statement

The paper addresses the challenge of Safe Offline Reinforcement Learning (Safe Offline RL) within the framework of Constrained Markov Decision Processes (CMDPs).

Context: In real-world applications (e.g., robotics, maritime navigation), agents must maximize cumulative rewards while adhering to safety constraints (e.g., avoiding collisions, staying within energy limits).
Limitations of Existing Methods:
- Optimization Instability: Standard approaches using Lagrangian multipliers (e.g., BCQ-Lagrangian) or min-max adversarial training (e.g., CPQ) often suffer from unstable training dynamics and are difficult to tune.
- Hard vs. Soft Constraints: Most reachability-based methods focus on hard constraints (instantaneous safety, e.g., "do not enter this zone") but fail to handle cumulative cost constraints (e.g., "total cost over an episode must be under $X$ ").
- Computational Overhead: Methods relying on generative models (VAEs) or online rollouts to track budgets are computationally expensive or require environment interaction, which is unavailable in offline settings.
Goal: Develop a method that learns a safe policy from a fixed dataset without environment interaction, ensuring cumulative cost constraints are met without unstable min-max optimization.

2. Methodology: Budget-Conditioned Reachability (BCR)

The authors propose Budget-Conditioned Reachability (BCR), a framework that decouples reward maximization from safety constraints by defining a dynamic, policy-independent safe set.

Core Concepts

Budget-Conditioned Persistent Safety Sets:
- Instead of checking if a state is safe now, the method estimates the optimal cost-value function ( $V^*_C$ ) and optimal action-cost-value function ( $Q^*_C$ ) to determine the minimum cumulative cost required to reach a terminal state from a given state.
- A Budget-Conditioned Persistent Safety Set ( $S_P(\delta)$ ) is defined as the set of states where the minimum future cost is less than or equal to a remaining budget $\delta$ .
- Similarly, a Safe Action Set ( $A_P(s, \delta)$ ) contains actions that keep the agent within the budget.
- Key Insight: If an agent starts in $S_P(\delta)$ and only selects actions from $A_P(s, \delta)$ , it is mathematically guaranteed to remain safe indefinitely (persistent safety) without needing to solve a constrained optimization problem during policy execution.
Budget-Adaptive MDP (BAMDP):
- The original CMDP is augmented by adding the remaining budget as a state variable. The new state space is $\bar{S} = S \times \mathbb{R}^+$ .
- Budget Update Functions ( $f$ and $g$ ):
  - Deterministic Settings: Uses Direct Budget Tracking, where the budget is updated by subtracting the immediate cost (adjusted for discounting).
  - Stochastic Settings: Uses Soft Budget Tracking, a novel formulation that accounts for the variance in future costs. It updates the budget based on the difference between the expected future cost and the actual cost incurred, ensuring the next state remains within the feasible subspace $\bar{S}_P$ in expectation.
Two-Stage Offline RL Algorithm (BCRL):
The method integrates with existing offline RL algorithms (e.g., IQL, XQL) in two steps:
- Step 1: Learn Safety Critics. Train a cost-critic ( $Q^*_C, V^*_C$ ) on the original dataset to minimize cumulative cost (ignoring rewards). This defines the persistent safety sets.
- Step 2: Train Reward Policy. Train a reward-maximizing policy on the augmented MDP ( $\bar{M}$ $\overset{ˉ}{M}$ ).
  - The policy is restricted to select actions only from the safe set $A_P(s, \delta)$ .
  - The dataset is dynamically augmented during training by sampling budgets $\delta$ from a uniform distribution $U[Q^*_C(s,a), \delta_{max}]$ , ensuring the agent only learns within the feasible region.
  - No Min-Max: Since safety is enforced by pruning the action space based on pre-computed critics, there is no need for adversarial training or Lagrangian multipliers.

3. Key Contributions

Theoretical Framework: Introduced Budget-Conditioned Reachability, which extends reachability analysis from hard constraints to cumulative cost constraints in CMDPs.
Novel Update Mechanisms: Proposed Direct and Soft Budget Tracking functions that allow the agent to reason about safety dynamically as the budget depletes, with rigorous proofs showing that policies constrained to the persistent safety set satisfy the original CMDP constraints.
Stable Offline RL Algorithm (BCRL): Developed a plug-and-play algorithm (instantiated with IQL) that:
- Avoids unstable min-max or Lagrangian optimization.
- Never queries out-of-distribution (OOD) actions.
- Generalizes to any cost budget without retraining.
Real-World Validation: Demonstrated the method on a complex, real-world Maritime Navigation task using historical AIS data, a scenario where unsafe exploration is impossible.

4. Experimental Results

The authors evaluated BCRL on the DSRL benchmark (38 tasks across SafetyGym, BulletGym, and MetaDrive) and a real-world maritime task.

Benchmark Performance:
- Safety: BCRL achieved 100% safety (Normalized Cost $\le 1$ ) across all 38 tasks. In contrast, many baselines (e.g., CDT, CAPS, LSPC) frequently violated constraints (Normalized Cost > 1).
- Reward: BCRL matched or outperformed state-of-the-art baselines in terms of normalized reward in 16 out of 38 tasks and achieved the highest average performance across all three benchmark suites.
- Efficiency: Training and evaluation were completed in minutes (e.g., ~7-18 mins per task), significantly faster than baselines requiring 2–3 hours.
Real-World Maritime Navigation:
- Task: Navigating ships in the Singapore Strait using historical AIS data to avoid collisions (close-quarters) while reaching a goal.
- Metrics: BCRL achieved the highest success rate (88%) and the lowest Average Displacement Error (ADE) compared to experts and other baselines.
- Safety: It reduced close-quarter events from 30% (expert baseline) to 26%, while maintaining realistic speed and acceleration profiles, unlike other methods that produced erratic trajectories to satisfy constraints.
Ablation Studies:
- Showed that the method is robust to hyperparameter choices, particularly the expectile parameters ( $\tau_C, \tau_R$ ).
- Demonstrated that even with partially trained cost critics (early stopping), the method maintains safety with only minor performance degradation.

5. Significance

This paper represents a significant shift in Safe Offline RL by moving away from optimization-based constraint handling (which is unstable and computationally heavy) toward reachability-based constraint enforcement.

Decoupling: By separating safety estimation from reward optimization, the method eliminates the circular dependency between cost critics and policies that plagues Lagrangian methods.
Scalability: The approach is computationally efficient and does not require generative models or online interaction, making it highly suitable for safety-critical real-world applications where data is static and exploration is dangerous.
Generalizability: The framework is compatible with any standard offline RL algorithm, offering a "plug-and-play" solution for enforcing cumulative safety constraints.

In summary, BCRL provides a theoretically grounded, stable, and efficient solution for learning safe policies in offline settings, successfully bridging the gap between theoretical reachability analysis and practical, high-performance reinforcement learning.

Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

1. The "Gas Tank" Analogy (The Budget)

2. The "Safety Map" (Reachability)

3. The "Dynamic Budget" (The Magic Trick)

4. Why This is Better (Decoupling)

The Real-World Test

In a Nutshell

1. Problem Statement

2. Methodology: Budget-Conditioned Reachability (BCR)

Core Concepts

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Efficient Embedding-based Synthetic Data Generation for Complex Reasoning Tasks

Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores

Scaling Attention via Feature Sparsity

Latent Semantic Manifolds in Large Language Models

Research on Individual Trait Clustering and Development Pathway Adaptation Based on the K-means Algorithm