⚛️ quantum physics

Quantum framework for Reinforcement Learning: Integrating Markov decision process, quantum arithmetic, and trajectory search

This paper proposes a fully quantum framework for reinforcement learning that integrates quantum Markov decision processes, arithmetic, and trajectory search to eliminate classical computations and demonstrate enhanced decision-making efficiency through quantum superposition.

Original authors: Thet Htar Su, Shaswot Shresthamali, Masaaki Kondo

Published 2026-04-23

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Thet Htar Su, Shaswot Shresthamali, Masaaki Kondo

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a robot how to navigate a giant, confusing maze to find the treasure. This is the basic idea of Reinforcement Learning (RL). The robot (the "agent") tries different paths, gets points (rewards) for good moves, and loses points for hitting walls. Over time, it learns the best route.

However, in the real world, mazes can be incredibly complex. If the maze is huge, a normal computer has to try every single path one by one, like a person walking through the maze, hitting a dead end, walking back, and trying the next door. This takes a long time and a lot of energy.

This paper proposes a radical new way to solve this problem using Quantum Computing. Instead of walking the maze one path at a time, the authors built a "Quantum Robot" that can walk every possible path at the exact same time.

Here is a simple breakdown of how they did it, using some creative analogies:

1. The Old Way vs. The Quantum Way

The Classical Way (The Single Hiker): Imagine a hiker trying to find the best route through a forest. They pick a path, walk it, see if it's good, go back, and try another. If there are a million paths, this takes forever.
The Quantum Way (The Ghost Hiker): In the quantum world, the robot isn't just one hiker. Thanks to a principle called Superposition, the robot becomes a "ghost" that exists in all possible paths simultaneously. It doesn't have to choose one path; it explores the entire forest at once.

2. The Three Magic Tricks

The authors built a complete system where the robot, the maze, and the rules of the game all exist inside a quantum computer. They used three main "magic tricks":

A. The Superposition Map (State Transitions)

In a normal computer, the robot is at one spot. In this quantum system, the robot is at every spot at once.

Analogy: Imagine a deck of cards. A normal computer looks at one card at a time. The quantum computer fans out the whole deck and looks at every card simultaneously. This allows the robot to see how every possible move affects the maze instantly.

B. The Quantum Calculator (Return Calculation)

The robot needs to add up all the points it gets along a path to see which one is the winner.

Analogy: Instead of a human adding numbers on a calculator one by one, the quantum computer uses Quantum Arithmetic. It's like having a magical abacus that adds up the scores for every single path in the forest at the exact same moment.

C. The Magic Compass (Grover's Search)

This is the most exciting part. Once the robot has explored all paths and calculated the scores, it needs to find the best one.

Analogy: Imagine you have a huge library with a million books, and only one book contains the secret to the treasure.
- Classical Search: You have to open every book one by one until you find it.
- Grover's Algorithm: This is like having a magical compass that instantly vibrates and points directly to the right book. The authors used this algorithm to instantly "zoom in" on the best path among all the possibilities the robot explored.

3. What Did They Prove?

The researchers tested this on a simple "maze" (a mathematical model called a Markov Decision Process) with 4 rooms and 2 choices at each step.

They ran the simulation on a quantum computer (simulated on a classical machine).
They found that their Quantum Robot found the exact same best path as a Classical Robot (using standard Q-learning).
The Big Win: The Quantum Robot found this answer much faster because it didn't have to walk the paths one by one. It evaluated the whole maze in a single "snapshot."

Why Does This Matter?

Currently, most "Quantum AI" is a hybrid: the brain is quantum, but the body is classical. This paper is special because it's a fully quantum system. The agent, the environment, and the decision-making are all inside the quantum realm.

Real-World Impact:

Self-Driving Cars: Instead of calculating one route at a time, a quantum car could instantly evaluate millions of traffic scenarios to find the safest, fastest route in a split second.
Medical Treatment: Doctors could simulate millions of different treatment plans for a patient simultaneously to find the one with the highest chance of success.
Stock Trading: Investors could instantly analyze every possible market movement to find the most profitable strategy.

The Bottom Line

This paper is a blueprint for a future where computers don't just solve problems faster; they solve them by looking at all possibilities at once. It's like upgrading from a flashlight that illuminates one step at a time to a floodlight that reveals the entire landscape instantly. While we aren't there yet with full-scale quantum hardware, this framework shows us exactly how to build the "Quantum Brain" for the next generation of intelligent machines.

1. Problem Statement

Reinforcement Learning (RL) faces significant scalability challenges in high-dimensional environments where state and action spaces grow exponentially. Classical RL algorithms (e.g., Q-learning, Policy Gradients) often require extensive computational resources and time to converge, particularly in stochastic environments. While hybrid quantum-classical approaches (using Variational Quantum Circuits or quantum-inspired heuristics) have been proposed, they suffer from bottlenecks:

Communication Overhead: They require frequent data exchange between classical and quantum systems, negating potential speedups.
Partial Quantumization: Often, only the agent (policy) is quantum, while the environment remains classical, limiting the exploitation of quantum parallelism.
Limited Scope: Previous "fully quantum" attempts were often restricted to simple single-state bandit problems rather than complex, multi-step Markov Decision Processes (MDPs).

The authors aim to design a fully quantum framework where the agent, the environment, their interactions, and the search for optimal policies occur entirely within the quantum domain, eliminating classical subroutines.

2. Methodology

The paper proposes a Quantum Markov Decision Process (QMDP) framework that maps classical RL components directly onto quantum operations.

A. Quantum Representation of MDP

State and Action Encoding: The MDP is defined with $N$ states and $M$ actions. States ( $S$ ) and actions ( $A$ ) are encoded as orthonormal basis vectors in a Hilbert space using qubits ( $n = \log_2 N$ for states, $m = \log_2 M$ for actions).
Superposition Initialization: Instead of sampling a single state-action pair, the system initializes all states and actions into a uniform superposition using Hadamard gates. This allows the agent to explore all possible state-action pairs simultaneously.

B. Quantum State Transitions and Rewards

State Transitions: Classical transition probabilities $P(s'|s,a)$ are encoded into quantum amplitudes. The authors use controlled- $R_y(\theta)$ gates on ancillary qubits. The rotation angle $\theta$ is calculated as $2\arcsin(\sqrt{P(s'|s,a)})$ . These gates are controlled by the current state and action qubits, ensuring the transition occurs only for specific pairs, effectively simulating the stochastic nature of the environment in superposition.
Reward Mechanism: Rewards are encoded using CNOT gates. If the resulting next state matches a condition for a reward, the reward qubit is flipped (e.g., $|0\rangle \to |1\rangle$ ). This conditionally entangles the reward with the trajectory.

C. Multi-Step Interaction and Return Calculation

Temporal Evolution: To simulate $T$ time steps, the framework uses CNOT gates to propagate the "next state" qubits of time $t$ into the "current state" qubits of time $t+1$ . This creates a sequential chain of interactions while preserving quantum superposition across all possible trajectories.
Quantum Return Calculation: The cumulative return (discounted sum of rewards) is calculated using quantum arithmetic. The reward qubits from each time step are added into a return register ( $|g\rangle$ ) using a sequence of CNOT and Toffoli gates (quantum adders). This allows the system to compute the total return for all $N$ possible trajectories simultaneously.

D. Quantum Trajectory Search (Grover's Algorithm)

Oracle Construction: A quantum oracle is designed to mark trajectories that yield the maximum return. The oracle flips the phase of the quantum state corresponding to high-return trajectories.
Amplitude Amplification: Grover's algorithm is applied to amplify the amplitudes of these marked optimal trajectories. This allows the system to identify the optimal policy (the sequence of actions leading to the highest return) with a quadratic speedup compared to classical search, requiring only $O(\sqrt{N})$ oracle calls instead of $O(N)$ .

3. Key Contributions

Fully Quantum MDP Implementation: The first framework to model the agent, environment, and their sequential interactions entirely within the quantum domain, removing the need for classical-quantum data conversion.
Quantum Arithmetic for Returns: A novel method for calculating cumulative rewards across multiple time steps using quantum adders, enabling parallel evaluation of trajectory returns.
Trajectory-Level Optimization: Unlike previous works that used Grover's search for single-step action selection, this framework applies Grover's algorithm to search for optimal full-length trajectories in a multi-state, multi-step MDP.
Parallel Exploration: Leveraging quantum superposition to evaluate numerous state-action sequences and their outcomes simultaneously, significantly reducing the number of interactions required to learn a policy.

4. Results and Demonstrations

The authors validated the framework using the IBM Qiskit Aer simulator (statevector simulation) on a classical MDP with 4 states ( $s_0$ to $s_3$ ) and 2 actions ( $a_0, a_1$ ) over 3 time steps.

Validation of Dynamics: The quantum circuit successfully reproduced the state transition probabilities and reward distributions of the classical MDP, confirmed via heat-maps and sample distributions.
Scenario 1 (Fixed Start): Starting from $s_0$ $s_{0}$ and terminating at $s_3$ $s_{3}$ , Grover's search identified two optimal trajectories with a maximum return of 8. The most frequent optimal trajectory was sampled 20 times.
- Comparison: Classical Q-learning required iterative updates to converge to the same optimal policy (Action $a_0$ at $s_0$ , $a_1$ at $s_2, s_3$ ). The quantum approach found this solution in a single search phase.
Scenario 2 (Variable Start): Starting from any state with equal probability, Grover's search identified trajectories maximizing the return (9). The algorithm correctly identified that action $a_1$ is optimal for all states in this scenario.
Efficiency: The quantum method achieved the same optimal policies as classical Q-learning but demonstrated the potential for sample efficiency (evaluating all paths in parallel) and computational speedup via Grover's search (finding the max in $\sqrt{N}$ steps).

5. Significance and Future Outlook

Theoretical Advancement: This work bridges the gap between theoretical quantum algorithms and practical RL by providing a complete, end-to-end quantum framework. It demonstrates that complex decision-making tasks can be solved without classical intervention.
Practical Applications: The authors highlight potential applications in:
- Autonomous Driving: Simultaneous evaluation of multiple driving trajectories for collision avoidance.
- Healthcare: Parallel evaluation of treatment plans to identify the most effective therapy.
- Finance: Real-time optimization of investment portfolios by rapidly identifying high-return paths.
Future Directions: The authors suggest future work should focus on scaling to larger state spaces (optimizing qubit usage), developing oracles that do not require prior knowledge of the maximum return, and exploring alternative search algorithms beyond Grover's.

In conclusion, this paper presents a robust proof-of-concept for Quantum Reinforcement Learning (QRL), demonstrating that a fully quantum approach can effectively model MDPs, calculate returns, and optimize policies with inherent advantages in parallelism and search efficiency over classical methods.