Safety-critical Control Under Partial Observability: Reach-Avoid POMDP meets Belief Space Control

Imagine you are driving a car in a thick fog. You can't see the road ahead, you don't know exactly where you are, and you have a strict rule: You must reach a specific destination (the Goal) without ever hitting a wall (the Safety Zone).

This is the exact problem robots face in the real world. They have noisy sensors, imperfect maps, and they can't see everything. This paper proposes a new, smarter way to drive this "foggy car" safely and efficiently.

Here is the breakdown of their solution using simple analogies:

The Problem: The "All-in-One" Driver vs. The Fog

Traditionally, robot programmers tried to solve this by giving the robot a single "brain" that had to do three things at once:

Drive to the goal.
Avoid hitting walls.
Figure out where it is (by moving around to get better sensor readings).

The authors argue that trying to do all three at the exact same speed is like asking a race car driver to also be a mechanic and a tour guide simultaneously. It's too much!

Safety needs to happen instantly (like slamming the brakes).
Finding the goal needs long-term planning (like plotting a route).
Gathering info happens at its own pace (like stopping to look at a map).

When you mix these conflicting speeds into one big calculation, the robot gets confused, moves too slowly, or makes dangerous mistakes.

The Solution: The "Layered Team"

The authors propose splitting the robot's brain into a layered team, where each member has a specific job and operates at their own speed. Think of it like a construction crew:

1. The Navigator (The Reference Controller)

Job: "Go to the green zone!"
How it works: This is the standard driver. It looks at the robot's best guess of where it is and points the car toward the goal. It doesn't worry about safety or fog; it just wants to get there.

2. The Detective (The BCLF - Belief Control Lyapunov Function)

Job: "Let's get a better look!"
The Analogy: Imagine the robot is in the fog. The Navigator wants to drive straight, but the Detective says, "Wait, if we drive this way, we might bump into a wall and learn exactly where we are."
How it works: This is the "Information Gathering" module. It uses a mathematical tool called a Lyapunov Function (think of it as a "happiness meter" for uncertainty). The robot moves in a way that lowers the "uncertainty meter." It learns that to reach the goal safely, it sometimes needs to take a detour to bump into a wall or look at a landmark to clear up the fog.
The Magic: They taught this "Detective" using Reinforcement Learning (trial and error). The robot learned that "bumping into a wall" is actually a good thing because it clears up the fog, allowing it to drive faster later.

3. The Safety Guard (The BCBF - Belief Control Barrier Function)

Job: "STOP! That's a cliff!"
The Analogy: This is the ultimate safety net. Even if the Navigator wants to drive fast and the Detective wants to explore, the Safety Guard has the final say.
How it works: It uses a tool called Conformal Prediction. Imagine the robot has 1,000 "ghosts" (particles) representing where it might be. The Safety Guard checks all 1,000 ghosts. If even a few of them are about to hit a wall, the Guard instantly tweaks the steering wheel to keep all of them safe. It doesn't just check "right now"; it guarantees safety for the entire trip ahead.

How They Work Together

The system works like a relay race with a referee:

The Navigator says, "Drive North!"
The Detective says, "Actually, let's drive North-East to bump into that wall so we know where we are."
The Safety Guard checks: "If we drive North-East, will any of our 1,000 ghosts hit a wall?"
- If Yes: The Guard tweaks the steering slightly to keep everyone safe, but still lets the robot move.
- If No: The robot drives exactly as the Detective suggested.

Why This is a Big Deal

It's Fast: Instead of solving one giant, impossible math problem, they solve three small, easy problems. This allows the robot to make decisions in real-time, even with thousands of "ghosts" (particles) tracking its location.
It's Reusable: The "Detective" (the part that learns how to clear the fog) doesn't need to be retrained if the goal changes. If you move the green goal zone, you just tell the Navigator to go there; the Detective still knows how to clear the fog.
It Works in Real Life: They tested this on a real robot that floats on air cushions (simulating a space robot). The robot had to navigate a room by bumping into walls to find its way. The system worked perfectly, reaching the goal safely while the robot was essentially "blind" for most of the trip.

The Bottom Line

This paper teaches robots how to be smart about their own ignorance. Instead of panicking when they can't see, they have a structured plan:

Detective: "Let's move to learn more."
Navigator: "Let's move toward the goal."
Safety Guard: "I'll make sure we don't crash while doing both."

By separating these jobs, the robot becomes faster, safer, and much better at navigating the unknown.

Here is a detailed technical summary of the paper "Safety-critical Control Under Partial Observability: Reach-Avoid POMDP meets Belief Space Control."

1. Problem Statement

The paper addresses the challenge of controlling robotic systems under partial observability where the true state is unknown and must be inferred from noisy observations. Specifically, it tackles Reach-Avoid Partially Observable Markov Decision Processes (POMDPs). The robot must:

Reach a desired goal set ( $S_g$ ) with high probability.
Avoid unsafe states ( $S_a$ ) with high probability.
Actively gather information to reduce uncertainty, as high uncertainty prevents reliable goal reaching or safety certification.

Core Challenge: Existing online POMDP solvers attempt to coordinate goal reaching, safety, and information gathering within a single belief tree search. This unified approach struggles because these objectives operate on conflicting time scales:

Safety requires high-frequency, reactive control to prevent constraint violations in continuous time.
Goal reaching and information gathering benefit from longer planning horizons and coarser temporal abstraction.
Merging them into a single optimization leads to scalability issues and unreliable deployment in safety-critical systems.

2. Methodology

The authors propose a layered, certificate-based control architecture that operates directly in belief space (the probability distribution over states). This architecture decouples the three behaviors into modular components, each operating at an appropriate frequency.

A. Belief Representation

The system uses a Continuous-Discrete Particle Filter (PF) to approximate the belief state.
The belief is represented as a set of $N$ particles. Between measurements, particles evolve via stochastic differential equations (SDEs). Upon new observations, particles are reweighted and resampled.
This allows the handling of non-Gaussian, high-dimensional beliefs (e.g., $>10^4$ dimensions).

B. Module 1: Information Gathering via Belief Control Lyapunov Functions (BCLF)

Concept: Information gathering is formalized as a Lyapunov convergence problem in belief space. The goal is to drive the belief into a set where the state is localized within a specific radius ( $\epsilon$ ) with high probability.
Uncertainty Quantification: Instead of using differential entropy (which fails for particle filters), the authors use Conformal Prediction to define a non-conformity score. This creates a probabilistic bound on the distance between the true state and the mean estimate.
Learning: Since designing hand-crafted BCLFs for high-dimensional belief spaces is intractable, the authors learn the BCLF using Reinforcement Learning (RL).
- They prove theoretically that an optimal RL value function (specifically, $W(b) = -V^*(b)$ ) satisfies the conditions of a stochastic Control Lyapunov Function under certain reward structures.
- To handle the permutation invariance of particle filters, a permutation-invariant encoder (using max-pooling over particle embeddings) is used.
Control: The BCLF controller minimizes the deviation from a reference controller while ensuring the Lyapunov function decreases, effectively guiding the robot to "informative" regions to reduce uncertainty.

C. Module 2: Safety via Belief Control Barrier Functions (BCBF)

Concept: Safety is enforced using Control Barrier Functions (CBFs) adapted for belief space.
Horizon Guarantees: Unlike previous methods that only offer pointwise-in-time safety, this approach uses Conformal Prediction to provide probabilistic safety guarantees over a finite time horizon.
Mechanism:
- The system tracks the running minimum of the safety function for each particle trajectory.
- It computes a non-conformity score based on the minimum safety violation across particles.
- Using Conformal Prediction, it identifies a subset of "top-p" particles that are most likely to be safe.
- A Quadratic Program (QP) is solved to find a control input that minimally deviates from the BCLF output while ensuring these top-p particles satisfy the stochastic CBF condition.

D. System Architecture

Reference Controller: A standard state-based controller drives the mean state toward the goal (assuming low uncertainty).
BCLF Layer: Activates when uncertainty is high, modifying the reference to gather information.
BCBF Layer: Acts as a high-frequency safety filter, correcting any control input that risks violating safety constraints for the finite horizon.

3. Key Contributions

Formalization of Information Gathering: Defined information gathering as finding a valid CLF in non-Gaussian belief space, framing it as an enabler for both safety and goal reaching.
Theoretical Link between RL and Certificates: Established conditions under which RL value functions constitute valid Stochastic and Finite-Time Control Lyapunov Functions in belief space.
Finite-Horizon Safety Filters: Developed Belief Control Barrier Functions (BCBFs) leveraging Conformal Prediction to provide probabilistic safety guarantees over a finite horizon, moving beyond pointwise assurances.
Modular Architecture: Demonstrated a scalable framework that decouples safety, goal reaching, and information gathering, allowing components to operate at different frequencies.
Hardware Validation: Validated the approach on a space-robotics platform (air-bearing table) with non-Gaussian beliefs and dimensions $>10^4$ , achieving real-time performance.

4. Results

The approach was evaluated in simulation and on hardware across three environments: Constrained Lightdark, Constrained Antenna, and Constrained Bumper.

Performance vs. State-of-the-Art: The proposed method significantly outperformed existing constrained POMDP solvers (CPOMCPOW, CPFT-DPW) based on Monte Carlo Tree Search (MCTS).
- Success Rate: Achieved near-perfect success rates (e.g., 0.99 in Lightdark, 0.95 in Antenna) compared to baselines which often failed due to either unsafe behavior or inability to localize.
- Safety: The BCBF layer successfully kept the system safe in almost all runs, whereas baselines without the specific safety filter frequently violated constraints.
Efficiency: The control synthesis reduces to lightweight QPs, solvable in real-time even with 8,000+ particles.
Ablation Studies:
- Lyapunov Coefficients: Tuning the BCLF coefficients allowed interpolation between aggressive information gathering and goal reaching, optimizing path length.
- Conflict Resolution: The architecture successfully resolved conflicts where information gathering and safety objectives clashed (e.g., escaping local minima).
- Reusability: A BCLF learned for one task (e.g., reaching a point) was successfully reused for a different task (e.g., circular tracking) without retraining, only requiring changes to the reference controller and safety set.
Hardware Experiments: The system successfully navigated a space-analogue robot through narrow corridors by actively bumping into walls to reduce uncertainty, demonstrating robustness to non-Gaussian beliefs and real-world noise.

5. Significance

This work bridges the gap between control theory (Lyapunov/Barrier functions) and decision-making under uncertainty (POMDPs).

Scalability: By decoupling the problem, it overcomes the "curse of dimensionality" and time-scale conflicts inherent in unified POMDP solvers.
Safety-Critical Deployment: It provides rigorous, probabilistic safety guarantees over finite horizons, a critical requirement for real-world robotics (e.g., space exploration, autonomous driving).
Practicality: The reliance on lightweight QPs and learned certificates makes the approach computationally feasible for high-dimensional, non-Gaussian systems, moving beyond theoretical frameworks to real-time hardware implementation.
Modularity: The ability to reuse learned information-gathering policies across different tasks offers a path toward more generalizable and efficient robotic autonomy.