When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift

Imagine you are teaching a robot dog to run a marathon. In a perfect video game, the robot has perfect eyes and ears; it sees every tree, feels every bump in the road, and knows exactly where it is. But in the real world, things go wrong. Maybe the camera lens gets smudged with mud, or the GPS signal drops out in a tunnel, or a sensor just decides to take a nap.

This paper is about teaching robots (specifically, AI agents using a method called PPO) how to keep running the marathon even when their "eyes" and "ears" start failing.

Here is the breakdown of their solution, using some everyday analogies:

1. The Problem: The "Amnesia" Robot

Most standard AI robots are like people with short-term memory loss. They only look at what is happening right now.

The Scenario: Imagine your robot is balancing on a tightrope. Suddenly, its left-eye camera goes black (sensor failure).
The Old Way: A standard robot (using an MLP) panics. It sees "black" and thinks, "I have no idea where I am!" It freezes or falls because it can't remember that it was leaning left five seconds ago.
The Reality: In the real world, sensors don't just fail once and fix themselves instantly. They often fail in clusters (like a whole group of sensors on a car losing power at once) and stay broken for a while. This is called "sensor drift."

2. The Solution: Giving the Robot a "Diary"

The authors decided to give the robot a memory. Instead of just looking at the current frame, the robot looks at a timeline of what happened in the last few seconds.

They tested three different ways to give the robot this memory:

The RNN/SSM (The "Recurrent" Memory): This is like a robot that tries to remember the past by whispering a summary of the last second to itself before looking at the next one. It's efficient, but if the whisper gets garbled (because a sensor failed), the whole chain of memory can get messed up.
The Transformer (The "Super-Searcher"): This is the star of the show. Imagine a robot that doesn't just whisper to itself. Instead, it has a giant whiteboard where it writes down everything that happened in the last minute. When it needs to make a decision, it doesn't just guess; it scans the whiteboard.
- The Magic: If the "left eye" sensor is broken, the robot looks at the whiteboard, sees that the "left eye" was working fine 3 seconds ago, and says, "Okay, I know what my left eye saw back then, so I can guess what it's seeing now." It can skip over the broken parts and focus on the good data.

3. The Experiment: The "Blindfold" Test

The researchers put these robots in a virtual gym (MuJoCo) with tasks like running, hopping, and walking.

The Setup: They simulated a disaster where up to 60% of the sensors were randomly broken or covered in mud.
The Results:
- The Standard Robot (MLP) fell apart immediately. Without perfect vision, it couldn't figure out how to move.
- The Whispering Robots (RNN/SSM) tried their best but often got confused when the "whisper" was interrupted by sensor failure.
- The Super-Searcher (Transformer) kept running. Even with half its sensors broken, it used its "whiteboard" (history) to fill in the gaps. It was the only one that stayed upright and kept moving forward.

4. The Math: Why It Works

The authors didn't just guess; they did the math. They proved a "safety guarantee."

Think of it like a weather forecast. They calculated the odds that the robot would fail.
They found that the robot's safety depends on two things:
1. How smooth the robot's brain is: If the robot makes tiny, gentle adjustments rather than wild jumps, it's safer.
2. How long the sensors stay broken: If sensors fail for a long time, the robot needs a better memory.
The math showed that the Transformer approach is the most robust way to handle these "bad weather" conditions.

The Big Takeaway

In the real world, things break. Sensors get dirty, networks lag, and data gets lost.

Old AI: "I can't see, so I stop."
New AI (Transformer-based): "I can't see right now, but I remember what I saw a moment ago, and I can guess what's happening. I'll keep going."

This paper proves that giving AI agents a temporal sequence model (a way to reason about time and history) is the secret sauce for making them reliable in the messy, unpredictable real world. It's the difference between a robot that trips over a pebble and a robot that knows how to step over it, even if it can't see the pebble clearly.

1. Problem Statement

Real-world reinforcement learning (RL) systems, such as those in robotics and autonomous driving, frequently operate under distributional drift caused by unreliable sensor feedback. Standard RL agents, particularly those using Proximal Policy Optimization (PPO) with feed-forward Multi-Layer Perceptrons (MLPs), assume fully observed, noise-free states. When sensors fail, drop out, or experience transient corruption, these agents suffer from partial observability, leading to brittle behavior and sharp declines in reward.

Existing approaches to partial observability often rely on simple masking or assume independent failures. They fail to account for the temporal persistence (failures lasting multiple time steps) and spatial correlations (groups of sensors failing together due to shared power or communication buses) inherent in real-world sensor networks.

2. Methodology

A. Sensor Failure Model

The authors propose a two-layer Markov process to model sensor reliability, capturing both individual and group-level dependencies:

Layer 1 (Individual): Each sensor $i$ follows a binary Markov chain with failure probability $p_{fail}$ and recovery probability $p_{recover}$ .
Layer 2 (Group): Sensor groups share a higher-level Markov chain representing subsystem dependencies (e.g., a shared bus).
Effective Status: A sensor is operational only if both its individual status and its group status are "up." This model allows for simulating bursty failures, prolonged outages, and correlated drifts.

B. Sequence-Based PPO Architectures

To address partial observability, the authors augment standard PPO with temporal sequence models to enable agents to infer missing information from history. They compare three architectural families:

MLP Baseline: A standard feed-forward network mapping the current state $s_t$ to an action.
Recurrent Models (RNNs/SSMs):
- RNNs: Gated Recurrent Units (GRU).
- State Space Models (SSMs): Linear Recurrent Units (LRU) and LinOSS (Oscillatory SSM).
- Mechanism: These maintain a hidden state $h_t$ that summarizes history, updated at each step.
Transformer-Based Models:
- Architectures: Standard Transformer, UniTS, and Gated Transformer-XL (GTrXL).
- Mechanism: Agents maintain a history buffer $B_t$ of recent observations. A Transformer encoder with self-attention processes this sequence. Crucially, they use attention pooling to aggregate the sequence into a fixed-size feature vector for the policy head, allowing the model to dynamically weigh available past observations while ignoring masked (failed) timesteps.

C. Theoretical Analysis

The paper derives a high-probability bound on infinite-horizon reward degradation under the stochastic sensor failure model.

Key Assumptions: Bounded sensor outputs, policy smoothness (Wasserstein Lipschitz continuity), and geometric ergodicity of the augmented state-mask chain.
The Bound: The expected reward loss scales linearly with:
1. The downtime of sensors ( $1 - \pi_x$ ).
2. The smoothness of the policy ( $L_\pi$ ) and the critic ( $L_Q$ ).
3. The mixing time ( $\tau$ ) of the failure process (how quickly the system recovers from bursty failures).
Implication: The bound suggests that robustness is maximized when policies are smooth and when the agent can effectively utilize history to mitigate the impact of sensor unavailability.

3. Key Contributions

Robust PPO Architectures: Integration of Transformers and modern SSMs into PPO to handle temporally persistent sensor failures, moving beyond standard MLPs.
Theoretical Framework: A rigorous proof quantifying how reward degradation depends on policy smoothness and failure persistence, providing a theoretical justification for using sequence models.
Comprehensive Empirical Evaluation: A systematic benchmark on MuJoCo continuous-control tasks (HalfCheetah, Hopper, Walker2d, Ant) under severe sensor dropout (up to 60% effective failure rate).

4. Experimental Results

Setup

Environments: Four MuJoCo continuous control tasks.
Conditions: Full observability vs. Partial observability (60% effective sensor dropout with correlated group failures).
Baselines: MLP, GRU, LRU, LinOSS, Transformer, UniTS, GTrXL.

Findings

Full Observability: Under ideal conditions, the simple MLP often outperforms or matches complex sequence models. Sequence models sometimes underperform due to unnecessary architectural complexity when the current state is sufficient.
Partial Observability (Sensor Failure):
- MLP: Suffers the most significant performance degradation, particularly in complex tasks like Hopper and Walker2d, due to its inability to recall missing information.
- RNNs/SSMs (GRU, LRU, LinOSS): Show limited robustness. While they offer some memory, their recurrent dynamics struggle with the irregularity of sensor dropouts, often exhibiting heavy "low-return tails."
- Transformers: Consistently outperform all other models. The Transformer-based PPO agents maintain high returns and stable performance even when large fractions of sensors are unavailable.
- UniTS: Performed poorly across all settings, likely due to an inductive bias mismatch (processing variables independently rather than jointly).

Why Transformers Win

The authors argue that self-attention provides inherent robustness. Unlike RNNs, which assume a regular input stream and can diverge when inputs are missing, Transformers can dynamically attend to any available past token. They naturally skip gaps in the observation stream without violating assumptions of temporal regularity.

5. Significance and Conclusion

This work establishes temporal sequence modeling as a critical mechanism for robust RL in unreliable environments.

Practical Impact: It demonstrates that for real-world deployment where sensor drift and dropout are inevitable, relying on simple feed-forward policies is insufficient.
Architectural Insight: Attention-based architectures (Transformers) are superior to recurrent models (RNNs/SSMs) for handling irregular, bursty observation failures because they do not rely on a fixed recurrence structure that can be disrupted by missing data.
Theoretical Validation: The derived bounds confirm that robustness is achievable if the policy is smooth and the agent can leverage history, providing a mathematical basis for these empirical findings.

In summary, the paper advocates for the adoption of Transformer-based sequence policies in safety-critical RL applications to ensure reliable operation under sensor unreliability.