⚛️ quantum physics

Scalable Quantum Reinforcement Learning on NISQ Devices with Dynamic-Circuit Qubit Reuse and Grover Optimization

This paper presents a scalable, resource-efficient quantum reinforcement learning framework that utilizes dynamic-circuit qubit reuse and Grover-based amplitude amplification to reduce the qubit complexity of multi-step quantum Markov decision processes from linear to constant while maintaining trajectory fidelity on NISQ hardware.

Original authors: Thet Htar Su, Shaswot Shresthamali, Masaaki Kondo

Published 2026-04-23

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Thet Htar Su, Shaswot Shresthamali, Masaaki Kondo

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Problem: The "Too Many Rooms" Dilemma

Imagine you are trying to teach a robot how to navigate a maze. In the world of Quantum Reinforcement Learning (QRL), the robot doesn't just walk through the maze; it explores every possible path at the same time using the weird powers of quantum mechanics (like superposition).

However, there was a major bottleneck. In previous methods, if you wanted the robot to plan 10 steps ahead, you needed 10 separate sets of quantum "rooms" (qubits) to store the robot's position at each step. If you wanted to plan 1,000 steps, you'd need 1,000 sets of rooms.

The Analogy: Think of this like a movie set. In the old way, to film a scene where a character walks down a hallway for 10 seconds, you had to build 10 separate, identical hallways on the set, one for each second. If the movie was long, you'd run out of studio space immediately. This is called Linear Scaling. Since current quantum computers (called NISQ devices) are small and noisy, they simply don't have enough "studio space" (qubits) to film long movies.

The Solution: The "Recycling Room" Trick

The authors of this paper introduced a clever new way to film the movie. Instead of building 10 separate hallways, they built one hallway and used a "magic reset button."

The Action: The robot takes a step.
The Snapshot: The computer takes a picture of where the robot is (measurement).
The Reset: The robot is instantly teleported back to the starting line of that specific hallway, but the memory of where it ended up is saved in a notebook (classical memory).
The Reuse: The same hallway is now ready for the next step.

The Analogy: Imagine you are playing a board game. Instead of buying a new board for every turn you make, you play on one board. After you move your piece, you write down your new position on a scorecard, then you pick up your piece and put it back on the starting square to make your next move. You only need one board no matter how long the game lasts.

This is called Dynamic Circuit Qubit Reuse. It changes the math from needing $N$ rooms for $N$ steps to needing just one room for $N$ steps.

The Secret Weapon: Grover's "Super Search"

Once the robot has played through the game and generated many possible paths (trajectories), the computer needs to find the best path (the one with the most points/rewards).

In a classical computer, you would have to check every single path one by one, like looking for a needle in a haystack.

The authors used Grover's Algorithm, which is like a magical metal detector.

Classical Search: You walk through the haystack, checking every piece of straw.
Grover's Search: You wave a magic wand, and the needle instantly starts glowing and vibrating, pulling itself out of the hay.

In this paper, they combined the "Recycling Room" trick with this "Magic Wand." They let the quantum computer generate all the paths using the recycled qubits, and then used Grover's algorithm to instantly amplify the probability of the best path, making it much more likely to be found when they finally look.

What Did They Actually Do?

Built the Framework: They created a system where a quantum agent interacts with a quantum environment, but instead of using new qubits for every second of time, they measure, reset, and reuse the same 7 qubits over and over again.
Proved it Works: They simulated this on a computer and showed it produces the exact same results as the old, space-hogging method, but using 66% fewer qubits.
Tested on Real Hardware: They ran this on a real, noisy quantum computer (an IBM Heron processor). Despite the computer being "noisy" (prone to errors), the system successfully found the optimal path, proving this method works on real-world devices today.

Why Does This Matter?

Before this paper, fully quantum reinforcement learning was stuck in the "toy phase." You could only solve very simple, short problems because you ran out of qubits too fast.

This paper breaks that barrier. It shows that we can now teach quantum agents to plan for longer, more complex futures without needing a quantum computer the size of a city. It turns the "impossible" into the "doable" on the small, imperfect quantum computers we have right now.

In a nutshell: They figured out how to make a quantum computer play a long game of chess by reusing the same 7 squares on the board instead of needing a new board for every move, and then used a magic search spell to instantly find the winning strategy.

1. Problem Statement

The paper addresses a critical scalability bottleneck in Fully Quantum Reinforcement Learning (QRL). While previous fully quantum approaches (specifically the framework by Su et al., 2025) demonstrated the feasibility of Quantum Markov Decision Processes (QMDPs), they suffered from linear qubit scaling.

The Bottleneck: In static, unrolled QMDP architectures, modeling an interaction horizon of $T$ steps requires allocating independent quantum registers for every time step. If a single interaction requires $k$ qubits (e.g., 7 qubits for a 4-state, 2-action environment), a horizon of $T$ steps requires $k \times T$ qubits.
The Consequence: This linear growth ( $O(T)$ ) makes multi-step QRL infeasible on current Noisy Intermediate-Scale Quantum (NISQ) devices, which have limited qubit counts and connectivity. For example, a 3-step interaction would require 21 qubits, quickly exhausting resources on near-term hardware.

2. Methodology

The authors propose a Dynamic-Circuit QRL Framework that decouples the interaction horizon from the physical qubit count. The methodology integrates three core components:

A. Dynamic Circuit Execution & Qubit Reuse

Instead of unrolling the circuit statically, the framework employs mid-circuit measurement and reset to recycle a fixed set of physical qubits across sequential time steps.

Workflow:
1. Interaction: The agent and environment interact coherently within a single time step using a fixed register (State, Action, Next-State, Reward).
2. Measurement: The state, action, and reward registers are measured to collapse the trajectory for that specific step.
3. Reset & Propagation: The measured qubits are reset to $|0\rangle$ . Crucially, the Next-State outcome is classically stored and then coherently propagated back to the State register (via CNOT gates) to serve as the input for the next time step.
4. Reuse: The same physical qubits are re-initialized (e.g., Action qubits re-superposed) for the next step.
Result: The physical qubit requirement becomes constant ( $O(1)$ ), independent of the horizon length $T$ .

B. Quantum Trajectory Accumulation

While interaction registers are measured and reset, a dedicated Return Register ($qReturn$) remains unmeasured throughout the entire $T$ -step process.

After each step, the reward $r_t$ is coherently added to the return register using quantum arithmetic (controlled addition), accumulating the total discounted return $G = \sum \gamma^t r_t$ .
This allows the system to maintain a quantum representation of the total trajectory reward without needing to store the entire history of states in qubits.

C. Grover-Based Policy Optimization

Once the trajectory generation is complete, the framework utilizes Grover's Algorithm for policy search.

Oracle Construction: An oracle marks basis states in the return register where the accumulated return equals the optimal value ( $g^*$ ).
Amplitude Amplification: Grover iterations amplify the probability amplitudes of these optimal trajectories.
Outcome: A final measurement yields the optimal state-action sequence (policy) with high probability, providing a quadratic speedup over classical exhaustive search.

3. Key Contributions

Reframing Resource Scaling: The paper demonstrates that the linear qubit scaling in QMDPs is an artifact of static circuit construction, not an inherent property of quantum MDPs. It shifts the complexity from $O(T)$ to $O(1)$ regarding physical qubits.
Correctness-Preserving Qubit Reuse: The authors prove that dynamic execution with mid-circuit measurement and reset reproduces the exact trajectory distribution and optimal policy structure of the static unrolled formulation. It is not a heuristic approximation but a mathematically equivalent transformation.
Unified Quantum-Native Architecture: The framework integrates trajectory generation, return evaluation, and policy optimization (via Grover search) into a single quantum-native pipeline, eliminating the need for classical post-processing or intermediate data conversion.
NISQ Compatibility: The architecture is specifically designed for current hardware constraints, utilizing dynamic circuits to maximize utility on devices with limited qubit counts.

4. Results

The framework was validated through both ideal simulation and execution on real quantum hardware.

Simulation (IBM Qiskit Aer):
- Fidelity: The dynamic-circuit implementation produced trajectory statistics identical to the static baseline (Ref. [7]).
- Efficiency: Achieved a 66% reduction in qubit usage for a 3-step horizon (7 qubits vs. 21 qubits).
- Policy Recovery: Successfully identified the optimal policy (Action $a_0$ for state $s_0$ , Action $a_1$ for $s_2, s_3$ ) matching the static baseline.
Hardware Execution (IBM Heron-class, ibm_toronto):
- Feasibility: The 3-step dynamic QMDP was successfully executed on a 133-qubit processor.
- Noise Handling: The authors implemented a 2000 ns delay between measurement and reset to stabilize qubits, mitigating timing errors.
- Grover Search: Despite hardware noise, the algorithm successfully sampled the optimal trajectories (T-151 and T-143) with the highest returns, confirming that Grover-based amplification works within a dynamic-circuit environment.
- Comparison: While the static formulation is impossible to run on this specific problem size due to qubit constraints, the dynamic approach demonstrated practical viability.

5. Significance

This work represents a pivotal step toward scalable, fully quantum reinforcement learning on near-term hardware.

Overcoming the NISQ Barrier: By breaking the linear dependence between planning horizon and qubit count, it enables the execution of multi-step decision processes on devices that previously could only handle single-step interactions.
Architectural Paradigm Shift: It establishes that dynamic circuits are not just for error correction but are essential for resource-efficient algorithm design in QRL.
Future Pathway: The framework provides a blueprint for scaling QRL to larger, more complex environments (longer horizons, larger state spaces) as quantum hardware improves, without requiring exponential growth in physical resources.

In summary, the paper successfully transforms QRL from a resource-prohibitive theoretical concept into a practically executable architecture for NISQ devices, maintaining full algorithmic correctness while drastically reducing hardware requirements.