Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning

This paper introduces iterated Shared Q-Learning (iS-QL), a resource-efficient method that bridges the performance gap between target-free and target-based reinforcement learning by sharing most network parameters while using a copy of the last linear layer as a target, thereby achieving stability without the memory overhead of traditional target networks.

Théo Vincent, Yogesh Tripathi, Tim Faust, Abdullah Akgül, Yaniv Oren, Melih Kandemir, Jan Peters, Carlo D'Eramo

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are teaching a robot to play a video game, like Super Mario or Pong. The robot learns by trying things, making mistakes, and getting points (rewards). To get really good, it needs to predict: "If I jump now, how many points will I get later?"

In the world of Artificial Intelligence, there are two main ways to teach this robot to make those predictions:

The Old Way: The "Strict Teacher" (Target-Based)

Imagine the robot has a Strict Teacher standing next to it.

  • The robot tries to guess the score.
  • The Teacher says, "No, that's wrong. Here is the correct score based on what I know."
  • The robot learns from the Teacher's answer.
  • The Problem: Every time the robot learns something new, the Teacher has to stop, think, and update their own knowledge before they can teach again. This takes time and memory. Also, the robot has to carry two sets of notes: one for itself and one for the Teacher. This is heavy and slow.

The "Wild" Way: The "Self-Taught" (Target-Free)

Now, imagine you take the Teacher away.

  • The robot tries to guess the score.
  • It immediately uses its own current guess to teach itself.
  • The Problem: This is like a student trying to learn math by only using the answers they just wrote down. If they make a small mistake, they use that mistake to learn, which makes the next mistake bigger. It's like a rumor spreading in a hallway; by the time it gets to the end, it's completely wrong. The robot gets confused and unstable.

The New Solution: The "Frozen Head" (iS-QL)

The authors of this paper, published at ICLR 2026, came up with a clever middle ground. They call it Iterated Shared Q-Learning (iS-QL).

Here is the analogy:
Imagine the robot is a chef learning to cook a complex dish.

  1. The Body (Shared Features): The chef has a full kitchen with a stove, knives, and ingredients. This is the "online network." It's constantly moving, chopping, and heating things up. This part is updated every second.
  2. The Hat (The Frozen Head): Instead of hiring a whole second chef (the Teacher) with a full kitchen, the robot just puts on a special hat that represents the "last step" of the recipe.
    • The robot uses its current, active kitchen (the body) to cook the dish.
    • But when it needs to check if the dish is good, it looks at the Hat. The Hat is frozen; it doesn't change while the robot is cooking. It holds a stable version of the "last step."
    • The robot compares its current cooking to the Hat's version.

Why is this genius?

  • Lightweight: The robot doesn't need a whole second kitchen (memory). It just needs one Hat. This saves a massive amount of computer memory.
  • Stable: Because the Hat doesn't change while the robot is cooking, the robot doesn't get confused by its own moving parts. It stays stable.

The "Superpower": Learning in Parallel

The paper takes this a step further. Imagine the robot doesn't just wear one Hat, but a stack of Hats (let's say 9 hats).

  • Hat #1 represents the recipe step from 1 second ago.
  • Hat #2 represents the step from 2 seconds ago.
  • ...
  • Hat #9 represents the step from 9 seconds ago.

The robot learns all these steps at the same time. It's like watching a movie and learning the plot, the character development, and the ending all in one go, rather than waiting for the movie to finish to understand the beginning.

The Result

The researchers tested this on many different games (from simple Atari games to complex robot walking tasks and even language puzzles like Wordle).

  • Before: Without a Teacher, the robot was slow and made mistakes. With a Teacher, it was fast but needed too much memory.
  • Now: With the "Stack of Hats" (iS-QL), the robot is fast (learning speed is high) and light (it uses half the memory of the old Teacher method).

In short: They found a way to give the robot a "stable memory" without needing to build a whole second brain. It's like giving a runner a pair of running shoes that are light enough to fly but sturdy enough to protect their feet, allowing them to run faster than ever before.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →