Dynamic Deep-Reinforcement-Learning Algorithm in Partially Observable Markov Decision Processes

Imagine you are trying to learn how to drive a car, but there's a catch: your windshield is foggy, your speedometer is broken, and sometimes the road signs are lying to you.

This is the real-world problem this paper tackles. In the world of Artificial Intelligence (AI), this is called a Partially Observable Markov Decision Process (POMDP). The AI agent (the driver) can't see the whole truth; it only sees a blurry, noisy version of reality.

Here is a simple breakdown of how the researchers at Cranfield University fixed this problem, using some creative analogies.

1. The Problem: The "Foggy Windshield"

Most AI training happens in a perfect world (like a video game) where the AI sees everything clearly. But in the real world, sensors fail, noise interferes, and data gets lost.

The Old Way: Previous AI methods tried to guess the truth by looking only at what they saw right now or a short history of what they saw. It's like trying to drive in fog by only looking at the bumper of the car in front of you.
The Missing Piece: The researchers realized that what you do (your actions) is just as important as what you see. If you turn the steering wheel (action) and the car doesn't move (observation), you know something is wrong with the road or the car, not your eyes.

2. The Solution: The "Memory Notebook" (RNNs)

To handle the fog, the AI needs a memory. The paper uses a type of AI called an LSTM (Long Short-Term Memory), which acts like a notebook where the agent writes down its history.

The Insight: The researchers found that if you only write down what you saw in your notebook, you miss the story. You need to write down what you did and what happened together.
The Analogy: Imagine a detective solving a crime.
- Bad Detective: Only looks at the crime scene photos (Observations).
- Good Detective: Looks at the photos AND writes down exactly what steps they took to get there (Actions).
- Result: The "Good Detective" (the new AI) solves the mystery much faster and more accurately because it understands the cause and effect.

3. Three New Architectures (Three Ways to Organize the Notebook)

The team tested three different ways to structure this "notebook" system to make the AI smarter and faster.

A. The "Unified Stream" (LSTM-TD3 1h1c / 1h2c)

The Old Way: The AI had two separate windows: one for the past history and one for the current moment. It was like reading a book where the past chapters were in a different language than the current chapter.
The New Way: They combined everything into one single stream. The AI reads the past and the present as one continuous story.
Why it works: It treats time as a smooth flow rather than a broken sequence. This helps the AI understand that "Action A at 10:00 AM" caused "Result B at 10:01 AM."

B. The "H-TD3" (The Smart Shortcut)

This is the paper's most exciting invention.

The Problem: Usually, the AI has two brains:
1. The Actor: Decides what to do.
2. The Critic: Judges how good that decision was.
  In complex environments, both brains have to read the entire history notebook from scratch every time. This is slow and computationally expensive (like two people reading the same 500-page book separately to write a review).
The H-TD3 Fix: The "Actor" reads the book and summarizes the story into a short note (a "hidden state"). It then hands this note to the "Critic."
The Analogy: Instead of the Critic re-reading the whole book, the Actor says, "Here is the summary of what happened so far." The Critic just reads the summary and the current situation.
Benefit: It's much faster (saves time) and uses less computer power, while still making almost as good decisions as the slow method.

4. The Results: Driving Through the Storm

The researchers tested these new methods in a "Pendulum" simulation (balancing a stick on a hand) under five different "storm" conditions:

Constant Bias: The sensors are always lying by a fixed amount.
Waves: The sensors lie in a rhythmic pattern.
Random Waves: The sensors lie in unpredictable patterns.
Static Noise: The sensors are just fuzzy (like TV static).
Hidden Info: A key piece of data (speed) is completely missing.

The Findings:

Action Matters: In almost every "storm," the AI that included actions in its memory learned much faster and drove better than the one that only looked at observations.
Robustness: The new methods could handle the "storms" that broke the old AI.
Speed: The H-TD3 algorithm was the fastest to train, proving you don't need to be slow to be smart.

Summary

Think of this paper as teaching an AI driver how to drive in a blizzard.

Don't just look; remember what you did. (Include actions in the memory).
Read the story as one continuous flow. (Unify the data streams).
Don't make the judge re-read the whole book. (Let the actor summarize the history for the critic).

By doing this, the AI becomes more robust (handles bad data better) and more efficient (learns faster), bringing us one step closer to robots that can actually work in our messy, unpredictable real world.

1. Problem Statement

The paper addresses the challenge of applying Deep Reinforcement Learning (RL) to Partially Observable Markov Decision Processes (POMDPs) in real-world environments characterized by time-varying disturbances.

The Core Issue: Standard RL algorithms (like TD3) assume full observability (MDP). In reality, sensors suffer from noise, missing data, or dynamic disturbances, making the true state ( $s_t$ ) hidden.
The Limitation of Current Methods: While Recurrent Neural Networks (RNNs), specifically LSTMs, are commonly used to infer latent states from history, most existing approaches:
1. Rely exclusively on sequences of observations, ignoring the causal role of actions in state transitions.
2. Use suboptimal network architectures (e.g., double-input channels) that may not align with belief-state estimation principles.
3. Suffer from high computational costs because both Actor and Critic networks must reprocess entire trajectories independently.
Goal: Develop robust, computationally efficient RL agents that can dynamically adapt to dynamic disturbances by correctly utilizing action histories and optimizing network architecture.

2. Methodology

The authors propose a systematic investigation into information selection, sequence length, and network architecture, culminating in a novel algorithm.

A. Theoretical Framework: Causality and Belief States

Causality: The paper argues that actions are causal drivers of state transitions. Excluding action history limits the agent's ability to distinguish between environmental dynamics and observation noise.
Belief State Construction: In POMDPs, the optimal policy relies on a belief state ( $b_t$ ), which is a sufficient statistic of the history of observations ( $o_{0:t}$ ) and actions ( $a_{0:t-1}$ ). The authors propose that RNNs should process unified sequences of both observations and actions to approximate this belief state effectively.

B. Proposed Architectures

The study introduces three novel architectures based on the standard LSTM-TD3 algorithm:

Unified Input Sequence (LSTM-TD3 variants):
- Instead of the original LSTM-TD3's double-input channel (one for history, one for current step), the authors propose treating the entire history (past observations + past actions + current observation/action) as a single unified sequence.
- LSTM-TD3 $_{1ha1hc}$ : Single input channel for both Actor and Critic.
- LSTM-TD3 $_{1ha2hc}$ : Single input channel for Actor; Critic has two channels (one for the hidden state, one for the current action).
H-TD3 (Hidden-State-Based TD3):
- Concept: To reduce computational overhead, H-TD3 shares the hidden states ( $h_t$ ) and cell states ( $c_t$ ) generated by the Actor network with the Critic network.
- Mechanism: During training, the Actor processes the sequence and stores its internal LSTM states. The Critic is then initialized with these stored states rather than re-processing the sequence from scratch.
- Benefit: This avoids redundant computation in the Critic network, significantly speeding up training while maintaining the quality of the trajectory representation.

C. Experimental Setup

Environment: OpenAI Gym "Pendulum-v0".
Disturbance Scenarios: Five distinct types were tested to evaluate robustness:
1. Temporal Bias: Constant additive bias for short durations.
2. Temporal Sinusoidal Wave: Predictable periodic disturbance.
3. Random Sinusoidal Wave: Unpredictable periodic disturbance (varying amplitude/frequency).
4. Noise: Gaussian white noise added to all observations.
5. Hidden: Removal of the angular velocity state ( $\dot{\theta}$ ) from observations.
Variables: The study varied the history length ( $l$ ) and the inclusion of action sequences.

3. Key Contributions

Validation of Action Sequence Inclusion: The study demonstrates that including past action sequences in the RNN input significantly improves robustness, particularly in environments with dynamic disturbances (e.g., sinusoidal waves) where causal relationships are critical for state estimation.
Architectural Optimization: The paper proposes that treating past and present data as a single unified sequence (rather than splitting them into separate channels) aligns better with belief-state estimation principles and yields superior performance.
H-TD3 Algorithm: Introduction of a novel algorithm that reuses Actor-generated hidden states for the Critic. This drastically reduces computational time (by avoiding re-processing sequences in the Critic) without sacrificing performance.
Comprehensive Robustness Analysis: Extensive evaluation across diverse disturbance types reveals that:
- Action inclusion is crucial for dynamic disturbances.
- Longer history lengths help filter noise but may lead to overfitting in pure noise scenarios.
- Models trained on dynamic disturbances generalize well to other dynamic environments but struggle with pure white noise (which lacks temporal correlation).

4. Results

Performance vs. TD3: All LSTM-based variants outperformed the standard TD3 in POMDP settings. TD3 failed to learn effectively in "noise" and "hidden" scenarios.
Action Inclusion Impact: Algorithms incorporating action sequences consistently outperformed those using only observations. The improvement was most pronounced in "Random Sinusoidal Wave" and "Temporal Sinusoidal Wave" scenarios.
Architecture Comparison:
- LSTM-TD3 $_{1ha1hc}$ (Unified single channel) showed the best overall robustness and optimality, suggesting that a unified stream is superior to the double-channel approach.
- H-TD3 achieved performance comparable to the best action-inclusive models (except in the "noise" scenario) but with significantly reduced iteration time.
Generalizability: Networks trained on "Random Sinusoidal Wave" disturbances generalized well to other dynamic disturbances (bias, damped waves) but performed poorly when transferred to pure "noise" environments, confirming that the agents learn the dynamics of the disturbance rather than just filtering noise.

5. Significance and Conclusion

This paper makes a significant contribution to the deployment of RL in real-world, partially observable systems by:

Bridging the Gap: It moves beyond simple observation history, emphasizing the causal necessity of action history for accurate belief state estimation.
Efficiency: The H-TD3 algorithm offers a practical solution to the high computational cost of training RNN-based RL agents, making real-time implementation more feasible.
Robustness: It provides clear guidelines on how to structure input data (unified sequences) and select history lengths to handle specific types of environmental disturbances.

The authors conclude that for dynamic RL, the separation of estimation and control is implicitly handled by the RNN when fed with the correct causal information (actions + observations), and that sharing internal states between Actor and Critic is a viable strategy for scaling these algorithms.