Dynamic Deep-Reinforcement-Learning Algorithm in Partially Observable Markov Decision Processes

This paper proposes three novel deep reinforcement learning architectures for Partially Observable Markov Decision Processes that incorporate action trajectories into Recurrent Neural Networks, with a specific focus on the H-TD3 algorithm which utilizes hidden states from the actor network to train the critic, thereby improving computational efficiency while maintaining performance.

Saki Omi, Hyo-Sang Shin, Namhoon Cho, Antonios Tsourdos

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are trying to learn how to drive a car, but there's a catch: your windshield is foggy, your speedometer is broken, and sometimes the road signs are lying to you.

This is the real-world problem this paper tackles. In the world of Artificial Intelligence (AI), this is called a Partially Observable Markov Decision Process (POMDP). The AI agent (the driver) can't see the whole truth; it only sees a blurry, noisy version of reality.

Here is a simple breakdown of how the researchers at Cranfield University fixed this problem, using some creative analogies.

1. The Problem: The "Foggy Windshield"

Most AI training happens in a perfect world (like a video game) where the AI sees everything clearly. But in the real world, sensors fail, noise interferes, and data gets lost.

  • The Old Way: Previous AI methods tried to guess the truth by looking only at what they saw right now or a short history of what they saw. It's like trying to drive in fog by only looking at the bumper of the car in front of you.
  • The Missing Piece: The researchers realized that what you do (your actions) is just as important as what you see. If you turn the steering wheel (action) and the car doesn't move (observation), you know something is wrong with the road or the car, not your eyes.

2. The Solution: The "Memory Notebook" (RNNs)

To handle the fog, the AI needs a memory. The paper uses a type of AI called an LSTM (Long Short-Term Memory), which acts like a notebook where the agent writes down its history.

  • The Insight: The researchers found that if you only write down what you saw in your notebook, you miss the story. You need to write down what you did and what happened together.
  • The Analogy: Imagine a detective solving a crime.
    • Bad Detective: Only looks at the crime scene photos (Observations).
    • Good Detective: Looks at the photos AND writes down exactly what steps they took to get there (Actions).
    • Result: The "Good Detective" (the new AI) solves the mystery much faster and more accurately because it understands the cause and effect.

3. Three New Architectures (Three Ways to Organize the Notebook)

The team tested three different ways to structure this "notebook" system to make the AI smarter and faster.

A. The "Unified Stream" (LSTM-TD3 1h1c / 1h2c)

  • The Old Way: The AI had two separate windows: one for the past history and one for the current moment. It was like reading a book where the past chapters were in a different language than the current chapter.
  • The New Way: They combined everything into one single stream. The AI reads the past and the present as one continuous story.
  • Why it works: It treats time as a smooth flow rather than a broken sequence. This helps the AI understand that "Action A at 10:00 AM" caused "Result B at 10:01 AM."

B. The "H-TD3" (The Smart Shortcut)

This is the paper's most exciting invention.

  • The Problem: Usually, the AI has two brains:
    1. The Actor: Decides what to do.
    2. The Critic: Judges how good that decision was.
      In complex environments, both brains have to read the entire history notebook from scratch every time. This is slow and computationally expensive (like two people reading the same 500-page book separately to write a review).
  • The H-TD3 Fix: The "Actor" reads the book and summarizes the story into a short note (a "hidden state"). It then hands this note to the "Critic."
  • The Analogy: Instead of the Critic re-reading the whole book, the Actor says, "Here is the summary of what happened so far." The Critic just reads the summary and the current situation.
  • Benefit: It's much faster (saves time) and uses less computer power, while still making almost as good decisions as the slow method.

4. The Results: Driving Through the Storm

The researchers tested these new methods in a "Pendulum" simulation (balancing a stick on a hand) under five different "storm" conditions:

  1. Constant Bias: The sensors are always lying by a fixed amount.
  2. Waves: The sensors lie in a rhythmic pattern.
  3. Random Waves: The sensors lie in unpredictable patterns.
  4. Static Noise: The sensors are just fuzzy (like TV static).
  5. Hidden Info: A key piece of data (speed) is completely missing.

The Findings:

  • Action Matters: In almost every "storm," the AI that included actions in its memory learned much faster and drove better than the one that only looked at observations.
  • Robustness: The new methods could handle the "storms" that broke the old AI.
  • Speed: The H-TD3 algorithm was the fastest to train, proving you don't need to be slow to be smart.

Summary

Think of this paper as teaching an AI driver how to drive in a blizzard.

  1. Don't just look; remember what you did. (Include actions in the memory).
  2. Read the story as one continuous flow. (Unify the data streams).
  3. Don't make the judge re-read the whole book. (Let the actor summarize the history for the critic).

By doing this, the AI becomes more robust (handles bad data better) and more efficient (learns faster), bringing us one step closer to robots that can actually work in our messy, unpredictable real world.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →