Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning

Imagine you are teaching a robot to play a video game, like Super Mario or Pong. The robot learns by trying things, making mistakes, and getting points (rewards). To get really good, it needs to predict: "If I jump now, how many points will I get later?"

In the world of Artificial Intelligence, there are two main ways to teach this robot to make those predictions:

The Old Way: The "Strict Teacher" (Target-Based)

Imagine the robot has a Strict Teacher standing next to it.

The robot tries to guess the score.
The Teacher says, "No, that's wrong. Here is the correct score based on what I know."
The robot learns from the Teacher's answer.
The Problem: Every time the robot learns something new, the Teacher has to stop, think, and update their own knowledge before they can teach again. This takes time and memory. Also, the robot has to carry two sets of notes: one for itself and one for the Teacher. This is heavy and slow.

The "Wild" Way: The "Self-Taught" (Target-Free)

Now, imagine you take the Teacher away.

The robot tries to guess the score.
It immediately uses its own current guess to teach itself.
The Problem: This is like a student trying to learn math by only using the answers they just wrote down. If they make a small mistake, they use that mistake to learn, which makes the next mistake bigger. It's like a rumor spreading in a hallway; by the time it gets to the end, it's completely wrong. The robot gets confused and unstable.

The New Solution: The "Frozen Head" (iS-QL)

The authors of this paper, published at ICLR 2026, came up with a clever middle ground. They call it Iterated Shared Q-Learning (iS-QL).

Here is the analogy:
Imagine the robot is a chef learning to cook a complex dish.

The Body (Shared Features): The chef has a full kitchen with a stove, knives, and ingredients. This is the "online network." It's constantly moving, chopping, and heating things up. This part is updated every second.
The Hat (The Frozen Head): Instead of hiring a whole second chef (the Teacher) with a full kitchen, the robot just puts on a special hat that represents the "last step" of the recipe.
- The robot uses its current, active kitchen (the body) to cook the dish.
- But when it needs to check if the dish is good, it looks at the Hat. The Hat is frozen; it doesn't change while the robot is cooking. It holds a stable version of the "last step."
- The robot compares its current cooking to the Hat's version.

Why is this genius?

Lightweight: The robot doesn't need a whole second kitchen (memory). It just needs one Hat. This saves a massive amount of computer memory.
Stable: Because the Hat doesn't change while the robot is cooking, the robot doesn't get confused by its own moving parts. It stays stable.

The "Superpower": Learning in Parallel

The paper takes this a step further. Imagine the robot doesn't just wear one Hat, but a stack of Hats (let's say 9 hats).

Hat #1 represents the recipe step from 1 second ago.
Hat #2 represents the step from 2 seconds ago.
...
Hat #9 represents the step from 9 seconds ago.

The robot learns all these steps at the same time. It's like watching a movie and learning the plot, the character development, and the ending all in one go, rather than waiting for the movie to finish to understand the beginning.

The Result

The researchers tested this on many different games (from simple Atari games to complex robot walking tasks and even language puzzles like Wordle).

Before: Without a Teacher, the robot was slow and made mistakes. With a Teacher, it was fast but needed too much memory.
Now: With the "Stack of Hats" (iS-QL), the robot is fast (learning speed is high) and light (it uses half the memory of the old Teacher method).

In short: They found a way to give the robot a "stable memory" without needing to build a whole second brain. It's like giving a runner a pair of running shoes that are light enough to fly but sturdy enough to protect their feet, allowing them to run faster than ever before.

1. Problem Statement

Deep Reinforcement Learning (DRL) algorithms, particularly value-based methods like Deep Q-Networks (DQN), face a fundamental trade-off between stability and resource efficiency:

Target-Based Approaches (e.g., DQN): Use a separate "target network" (a delayed copy of the online network) to stabilize training and mitigate the "deadly triad" (function approximation, bootstrapping, and off-policy data). However, this doubles the memory footprint dedicated to Q-networks, limiting the size of the online network on memory-constrained hardware (e.g., edge devices or large-scale models).
Target-Free Approaches: Remove the target network to save memory and use up-to-date estimates. While memory-efficient, these methods often suffer from training instability, lower sample efficiency, and a significant performance gap compared to target-based methods.

The paper addresses the need for an algorithm that retains the low memory footprint of target-free methods while achieving the stability and performance of target-based methods.

2. Methodology: Iterated Shared Q-Learning (iS-QL)

The authors propose iterated Shared Q-Learning (iS-QL), a novel architecture that bridges the gap by modifying the network structure rather than the optimization objective alone.

Core Architecture: Shared Features with Frozen Heads

Instead of maintaining two full network copies (online and target), iS-QL uses a single network with a specific parameterization:

Shared Parameters ( $\omega$ ): The feature extractor and all layers except the final linear layer are shared across the entire network.
Multiple Linear Heads ( $\omega_k$ ): The network has $K+1$ $K + 1$ linear output heads.
- Head 0 ( $\omega_0$ ): Acts as the "target" for the first iteration. It is frozen (not updated via gradients) during the standard training step.
- Heads $1$ to $K$ ( $\omega_1, \dots, \omega_K$ ): These are the "online" heads. Head $k$ is trained to regress the Bellman target generated by Head $k-1$ .
Target Update Mechanism: Similar to DQN, every $T$ steps, the parameters of the heads are shifted: $\omega_k \leftarrow \omega_{k+1}$ . This propagates the learned values forward through the chain of heads.

Iterated Q-Learning Concept

The method leverages Iterated Q-Learning, which learns multiple consecutive Bellman updates in parallel.

Loss Function: The total loss is the sum of temporal difference (TD) errors for each head:
$L_{iS-QN} = \sum_{k=1}^{K} ( \lceil r + \gamma \max_{a'} Q_{k-1}(s', a') \rceil - Q_k(s, a) )^2$
where $\lceil \cdot \rceil$ denotes a stop-gradient operation.
Mechanism: By sharing the feature extractor ( $\omega$ ) but freezing the immediate previous head ( $\omega_{k-1}$ ) during the gradient step, the network learns a chain of value functions ( $Q_0 \to Q_1 \to \dots \to Q_K$ ) that approximate $K$ steps of Bellman iterations simultaneously.

Key Design Choices

Memory Efficiency: Since only the final linear layer parameters ( $\omega_k$ ) are duplicated, the memory overhead is negligible compared to duplicating the entire network.
Orthogonality: This approach is orthogonal to existing regularization techniques (e.g., LayerNorm, BatchNorm, MellowMax) and can be combined with them for further gains.

3. Key Contributions

Novel Architecture (iS-QL): Introduced a method that uses a single Q-network with shared features and multiple linear heads to simulate a target network without the memory cost.
Bridging the Gap: Demonstrated that simply storing a copy of the last linear layer (K=1) significantly reduces the performance gap between target-free and target-based methods. Increasing $K$ (parallel Bellman iterations) further improves sample efficiency.
Theoretical Insight: Showed that iS-QL alters the learning dynamics to be more similar to target-based approaches. Specifically:
- Gradient Alignment: The cosine similarity between iS-QL gradients and target-based gradients is higher than that of target-free gradients.
- Target Churn: The "churn" (change in regression targets between batch updates) in iS-QL is significantly lower than in target-free methods, leading to more stable training.
- Representation Capacity: iS-QL maintains a higher effective rank (srank) of features compared to target-free methods, indicating richer state representations.
Versatility: Validated the approach across diverse settings: Online/Offline, Discrete/Continuous control, and even Large Language Models (LLMs).

4. Experimental Results

The authors evaluated iS-QL on 15 Atari games (CNN and IMPALA architectures), DeepMind Control Suite (SAC), Offline RL (CQL), and a language task (Wordle with GPT-2).

Atari (Online Discrete Control):
- CNN Architecture: Target-free DQN (TF-DQN) dropped ~10% in Area Under the Curve (AUC) compared to Target-Based DQN (TB-DQN). iS-DQN ( $K=9$ ) not only closed this gap but outperformed TB-DQN by 6%.
- Memory: iS-DQN uses approximately half the parameters of TB-DQN while achieving superior performance.
- IMPALA Architecture: Similar trends observed; iS-DQN eliminated the performance gap entirely as $K$ increased.
Offline Control (CQL):
- iS-CQL reduced the performance gap from 26% (TF-CQL) to 6% compared to TB-CQL.
- An "Ensemble Shared Features" variant (ES-CQL) also showed improvements, proving the concept extends beyond iterated chains.
Continuous Control (SAC):
- iS-SAC recovered the performance drop caused by removing the target network.
- It reduced the total parameter count by 49% compared to the target-based baseline.
Language Models (Wordle/ILQL):
- Applied to GPT-2 small. iS-ILQL ( $K=9$ ) improved learning speed by 10% over the target-free version and saved 33% of RAM compared to the target-based version.
Streaming RL:
- In a streaming setting (no replay buffer), iS-Stream Q( $\lambda$ ) improved learning speed by >10% over the target-free baseline.

5. Significance and Impact

Resource-Efficient RL: This work provides a practical solution for deploying high-performance RL on hardware with strict memory constraints (e.g., robotics, edge devices) without sacrificing the stability of target networks.
Scalability: By reducing the memory footprint of the Q-network, it enables the use of larger, more expressive feature extractors (which are often limited by the need to store a target copy).
Paradigm Shift: It challenges the binary choice between "target-free" and "target-based," proposing a spectrum where the target mechanism is integrated into the network structure itself.
Future Directions: The authors suggest combining iS-QL with mixed-precision training (float16) to further reduce resource requirements, which showed promising pilot results.

In conclusion, iS-QL successfully decouples the stability benefits of target networks from their memory costs, offering a new standard for efficient deep reinforcement learning.