Towards Batch-to-Streaming Deep Reinforcement Learning for Continuous Control

Imagine you are teaching a robot dog how to walk.

The Old Way: The "Library Study" Method

For a long time, the best way to teach robots was the Batch Learning method. Think of this like a student studying for a big exam in a quiet library.

How it works: The robot tries something, fails, remembers it, tries again, fails again. It collects thousands of these "failures and successes" in a giant notebook (called a Replay Buffer).
The Problem: Once the notebook is full, the robot sits down, opens the book, and studies all the notes at once to figure out the pattern. It's very smart and learns efficiently, but it's slow and requires a massive amount of memory (like a heavy laptop).
The Limitation: You can't put this heavy laptop inside a tiny, battery-powered robot that needs to learn while it's actually walking around in the real world. The robot would run out of battery or memory before it finished studying.

The New Idea: The "Streaming" Method

Recently, researchers developed Streaming Learning. This is like a student who learns by walking down the street.

How it works: The robot tries a step, sees if it falls, and immediately adjusts its balance right then and there. It doesn't write anything down in a notebook. It just learns from the very next step.
The Benefit: It's incredibly lightweight. It fits on a tiny chip and uses very little power.
The Problem: Because it doesn't look back at its past mistakes, it's often "dumber" than the library student. It gets confused easily, needs a lot of trial and error, and is very sensitive to how you teach it (like needing the perfect temperature for coffee to learn).

The Big Gap: The "Sim2Real" Problem

Here is the tricky part of the story:

We usually train robots in a Video Game (Simulation) first. We use the "Library Method" (Batch) there because we have powerful computers. The robot learns to walk perfectly in the game.
Then, we put the robot in the Real World. The real world is messy. The floor is slippery, the wind blows, and the robot's joints are slightly different than in the game.
We want the robot to Fine-Tune its skills in the real world. But we can't use the "Library Method" because the robot is too small to carry the notebook. We need the "Streaming Method."

The Conflict: The "Library" teacher (Batch) and the "Street" teacher (Streaming) speak different languages. If you take a robot trained in the library and suddenly switch it to the street method, it often panics and forgets everything. It's like switching a student from studying with a textbook to learning by watching a fast-paced TikTok video without any context.

The Solution: "Batch-to-Streaming"

This paper introduces two new algorithms, S2AC and SDAC, which act as a universal translator.

They are "Streaming" but "Smart": These new methods learn on the fly (like the street student) but are designed to understand the lessons taught by the "Library" methods (SAC and TD3). They don't need a heavy notebook, but they learn almost as well as the ones that do.
They are "Plug-and-Play": Unlike previous streaming methods that required you to tweak dozens of settings (like tuning a radio to find a signal), these new methods work out of the box. You don't need to be a math genius to make them work.
The Secret Sauce (The "Optimizer"): The authors discovered that the "Library" methods use a specific type of math engine (called an optimizer) that builds up heavy "muscle memory" (large weight norms) that makes it hard to adapt later. They found a way to use a lighter, more flexible engine during the initial training. This keeps the robot's "muscles" loose and ready to adapt when it switches to the real world.

The Analogy: The Pilot and the Co-Pilot

Think of the robot's brain as a pilot.

Batch Learning is the Senior Pilot who has flown 10,000 hours in a simulator. They are perfect but need a huge cockpit with lots of screens.
Streaming Learning is the Co-Pilot who is learning to fly in a tiny, open-cockpit plane. They are agile but inexperienced.
The Problem: If you try to swap the Senior Pilot for the Co-Pilot mid-flight, the plane crashes because they don't know the same procedures.
The Paper's Contribution: They trained the Co-Pilot using the Senior Pilot's exact flight manual, but taught them to fly without the big screens. Now, the Co-Pilot can take over the controls in the tiny plane immediately, knowing exactly what the Senior Pilot would have done, without needing to relearn everything from scratch.

Why Does This Matter?

This is a huge step toward real-world robotics.

Tiny Robots: It allows us to put smart AI on small, battery-powered robots (like search-and-rescue drones or medical bots) that can learn and adapt while they are working, without needing a supercomputer in the cloud.
Safety: It means we can train a robot in a safe video game, then let it finish its training on the real factory floor or in a disaster zone, adapting to real-world surprises instantly.

In short, the authors built a bridge that lets us take the "smart" training from powerful computers and seamlessly transfer it to the "lightweight" brains of robots that live in the real world.

1. Problem Statement

Deep Reinforcement Learning (DRL) has achieved state-of-the-art performance in continuous control tasks (e.g., robotics). However, standard algorithms like Soft Actor-Critic (SAC) and Twin Delayed DDPG (TD3) rely on computationally expensive mechanisms:

Replay Buffers: Storing large amounts of past experience.
Batch Updates: Processing data in mini-batches rather than single steps.
Target Networks: Maintaining separate, slowly updated networks for stability.

These mechanisms make on-device learning (e.g., on resource-constrained edge robots) difficult due to memory and latency constraints. While Streaming RL (purely online updates without buffers) exists, current methods (e.g., Stream AC(λ)) often suffer from:

Hyperparameter Sensitivity: Requiring precise tuning of learning rates and entropy temperatures.
Incompatibility: They are not designed to seamlessly accept policies pre-trained by standard batch algorithms (SAC/TD3), hindering Sim2Real transfer where a policy is trained in simulation (batch) and fine-tuned on the real robot (streaming).

2. Methodology

The authors propose two novel streaming algorithms: Streaming Soft Actor-Critic (S2AC) and Streaming Deterministic Actor-Critic (SDAC). These are direct adaptations of SAC and TD3 to the streaming paradigm, designed to maintain compatibility with their batch counterparts.

Core Architectural Choices (Shared by S2AC & SDAC)

To ensure stability in the noisy, single-sample update regime, the authors adopt several stabilization techniques inspired by recent streaming literature:

Sparse Network Initialization: Prevents early saturation and improves gradient flow.
LayerNorm: Applied to pre-activations to normalize internal representations.
Data Normalization: Online normalization of observations (using Welford's algorithm) and scaling of rewards based on running standard deviation ( $\sigma_r$ ).
Optimizer Strategy:
- Critic: Uses Overshooting-bounded Gradient Descent (ObGD) combined with Eligibility Traces (TD( $\lambda$ )). This replaces the standard Adam optimizer to prevent overshooting in the absence of batch averaging.
- Actor: Uses standard Adam (similar to batch methods) but without eligibility traces.

Specific Algorithmic Innovations

A. Streaming Soft Actor-Critic (S2AC)

Objective: Maximizes entropy-regularized returns.
Key Innovation (Adaptive Entropy): In standard SAC, the entropy coefficient $\alpha$ is fixed or auto-tuned. In streaming with reward scaling, the effective magnitude of rewards fluctuates. The authors propose scaling $\alpha$ dynamically: $\alpha \to \alpha / \sigma_r$ . This ensures the trade-off between reward maximization and entropy exploration remains consistent regardless of the current reward scale.
Update: Minimizes the soft Bellman residual using eligibility traces on the critic.

B. Streaming Deterministic Actor-Critic (SDAC)

Objective: Deterministic policy gradient (DPG) for off-policy learning.
Key Innovation (Target Noise): To prevent the critic from overfitting to narrow peaks in the value estimate (a common issue in deterministic policies), SDAC injects Gaussian noise ( $\epsilon_2$ ) into the target action during the Bellman update, similar to TD3.
Update: Uses the Deterministic Policy Gradient theorem for the actor and ObGD with traces for the critic.

The "Batch-to-Streaming" Transition Strategy

The paper identifies a critical failure point when switching from a batch-trained policy (e.g., TD3) to a streaming fine-tuner (SDAC): Optimizer Mismatch.

Problem: Batch TD3 uses Adam, which tends to accumulate large weight norms. Streaming SDAC uses ObGD. The authors hypothesize that large weight norms (induced by Adam) reduce the network's "plasticity," making it unable to adapt to the distribution shift of the real world.
Solution: During the batch pre-training phase, replace the critic's optimizer with SGDC (SGD with Clipping). SGDC produces smaller weight norms (similar to ObGD) while maintaining competitive performance. This creates a "compatible" initialization that allows the streaming algorithm to fine-tune effectively without catastrophic forgetting or performance collapse.

3. Key Contributions

Novel Algorithms: Introduction of S2AC and SDAC, the first streaming algorithms explicitly designed to be compatible with state-of-the-art batch methods (SAC and TD3).
Hyperparameter Robustness: Unlike previous streaming methods (e.g., AVG) that require per-environment tuning, S2AC and SDAC achieve competitive performance with minimal tuning.
Batch-to-Streaming Framework: A systematic investigation into the practical challenges of transitioning from batch to streaming learning. The authors propose the SGDC pre-training strategy to bridge the gap, enabling effective Sim2Real fine-tuning.
Performance Boost for Batch Methods: The paper demonstrates that applying streaming-inspired techniques (specifically data normalization and SGDC) to standard batch algorithms (SAC/TD3) actually improves their performance even when running in batch mode.

4. Experimental Results

The authors evaluated their methods on MuJoCo Gym and DM Control Suite environments.

Performance from Scratch:
- S2AC and SDAC achieve performance comparable to the state-of-the-art Stream AC(λ) across various continuous control tasks.
- They do not require the tedious hyperparameter tuning (e.g., learning rate, temperature) that competitors like AVG require.
Ablation Studies:
- Adaptive $\alpha$ : Crucial for S2AC stability; fixed $\alpha$ leads to poor performance due to reward scaling fluctuations.
- Target Noise: Critical for SDAC; without it, the algorithm fails to learn entirely.
- Data Normalization: Applying observation normalization and reward scaling to standard batch TD3/SAC significantly boosts their performance, validating the utility of these techniques beyond just streaming.
Sim2Real / Batch-to-Streaming Fine-tuning:
- Naive Switch: Directly switching from a standard TD3 (Adam) checkpoint to SDAC results in severe performance drops.
- Proposed Strategy: Pre-training TD3 with SGDC (instead of Adam) preserves performance during pre-training but results in a model that can successfully fine-tune with SDAC. In several environments (e.g., walker-run, dog-walk), this approach allows the agent to surpass the performance of training from scratch with fewer samples.
- Limitations: While promising, the transition on complex tasks like quadruped-run still shows some instability, indicating an open research challenge.

5. Significance and Impact

This work bridges a critical gap between theoretical DRL and practical robotic deployment.

Enabling On-Device Learning: By removing the need for replay buffers and heavy batch processing, these algorithms make it feasible to run RL directly on edge devices with limited compute and memory.
Practical Sim2Real: It provides a concrete, tested methodology for transferring policies from simulation to reality. Instead of retraining from scratch on the real robot (which is dangerous and data-inefficient), one can pre-train in simulation and use a lightweight streaming algorithm for safe, online adaptation.
Unified Paradigm: The paper argues that batch and streaming algorithms should not be viewed as separate paradigms but as variations of a shared foundation. Techniques like data normalization and specific optimizer choices (SGDC) benefit both regimes, suggesting a path toward more robust and adaptable RL systems.