Enhancing sample efficiency in reinforcement-learning-based flow control: replacing the critic with an adaptive reduced-order model

Imagine you are trying to teach a robot how to fly a plane through a violent storm. The goal is to keep the plane steady while using as little fuel as possible.

In the world of fluid dynamics (controlling how air or water moves), this is exactly what scientists do. They try to control things like reducing the drag on a car or stopping a bridge from shaking in the wind.

For a long time, there were two main ways to teach these robots, and both had big problems:

The "Physics Textbook" Way (Model-Based): You write down every single law of physics in a giant equation. The robot solves these equations to figure out what to do.
- The Problem: The equations are so huge and complex that solving them takes forever. It's like trying to calculate the trajectory of every single raindrop in a storm to steer the plane. It's too slow and expensive.
The "Trial and Error" Way (Model-Free Deep Reinforcement Learning): You let the robot fly the plane, crash it, fix it, fly it again, and crash it again. Over thousands of tries, it learns what works.
- The Problem: This is incredibly wasteful. The robot might need to crash 10,000 times before it learns to fly smoothly. In the real world, you can't afford to crash a plane (or a ship) that many times. It's "low sample efficiency."

The New Solution: The "Smart Sketch" Coach

This paper introduces a brilliant new middle ground. Instead of using the giant physics textbook or letting the robot crash blindly, the authors created a "Smart Sketch" Coach.

Here is how it works, using a simple analogy:

1. The Problem with the Old "Coach" (The Critic)

In the old "Trial and Error" method, the robot has a "Coach" (called a Critic in AI terms) that watches the flight and says, "Good job!" or "Bad job!"

The Flaw: This Coach is a "Black Box." It's a giant neural network that knows what to do but doesn't really understand why. It's like a coach who just guesses based on gut feeling. To get good at guessing, the coach needs to watch thousands of crashes.

2. The New "Smart Sketch" (The Adaptive ROM)

The authors replaced the Black Box Coach with a Smart Sketch.

The Sketch: Imagine you are trying to draw a complex, swirling storm. Instead of drawing every single water molecule (which takes forever), you draw a few key swirls and lines that capture the essence of the storm. This is called a Reduced-Order Model (ROM). It's a simplified, fast version of the real physics.
The "Adaptive" Part: Here is the magic. As the robot flies, it collects new data. The "Smart Sketch" updates itself in real-time. If the storm gets windier, the sketch changes to reflect that. It learns from the robot's experience instantly.
The "Hybrid" Brain: The sketch has two parts:
- The Linear Part: A simple, fast calculator that handles the basic, predictable movements (like a car driving straight).
- The Neural Part (NODE): A small, flexible AI brain that learns the messy, unpredictable "jitters" and swirls that the simple calculator misses.

3. How the Training Works (The Loop)

Instead of crashing the plane thousands of times, the process looks like this:

Fly: The robot flies a short distance in the real simulation (or wind tunnel) and collects data.
Update the Sketch: The "Smart Sketch" looks at that new data and updates its drawing to be more accurate.
Practice in the Sketch: The robot practices its maneuvers inside the fast, simplified sketch. Because the sketch is simple, the robot can simulate millions of flights in the time it takes to fly one real flight.
Optimize: The robot finds the perfect way to fly using the sketch.
Repeat: It goes back to the real world, tries the new trick, collects more data, and updates the sketch again.

Why This is a Big Deal

The authors tested this on two classic problems:

Smoothing out air over a flat plate (Blasius boundary layer): This is like trying to keep the air flowing smoothly over a car hood.
- Result: The new method found a perfect controller in just one round of data collection. The old AI methods needed hundreds of tries.
Stopping a square cylinder from shaking (Square cylinder wake): This is like trying to stop a square building from wobbling in the wind.
- Result: The new method reduced the shaking (drag) significantly better than the old AI methods, but it used far less data. It achieved the same results as methods that required 150 sensors, but it only needed 4 sensors.

The Takeaway

Think of it like learning to ride a bike.

Old AI: You fall off 1,000 times until your muscles remember how to balance.
Old Physics: You spend 10 years studying the physics of balance, but you never actually get on the bike because the math is too hard.
This New Method: You get on the bike, fall once, and a smart coach (the Sketch) instantly draws a map of why you fell and shows you exactly how to balance next time. You learn in minutes what used to take days.

This paper proves that by combining simple physics with smart, learning AI, we can control complex fluid flows much faster, cheaper, and more efficiently than ever before.

1. Problem Statement

Deep Reinforcement Learning (DRL) has shown promise in active flow control but suffers from poor sample efficiency. Traditional model-free DRL algorithms (e.g., PPO, SAC, TD3) rely on a "critic" network (a deep neural network) to approximate the value function. This critic acts as a black box, lacking physical constraints and requiring massive amounts of interaction data (often thousands of episodes) to converge. This data hunger makes DRL computationally prohibitive for high-fidelity Computational Fluid Dynamics (CFD) simulations where each evaluation is expensive.

The core challenge is to design a control framework that:

Retains the ability to handle nonlinear flow dynamics.
Drastically reduces the number of CFD interactions (samples) required for training.
Bridges the gap between model-based control (which is data-efficient but often linear) and model-free DRL (which is flexible but data-hungry).

2. Methodology: Adaptive ROM-Based RL Framework

The authors propose a novel Model-Based Reinforcement Learning (MBRL) framework where the traditional DRL critic is replaced by an Adaptive Reduced-Order Model (ROM). Instead of learning a value function, the agent learns a dynamic model of the flow to perform gradient-based policy optimization via differentiable simulation.

A. The Adaptive ROM Architecture (NODE-OpInf-ROM)

The ROM is a hybrid model designed to capture both linear and nonlinear flow dynamics:

Linear Component (Operator Inference - OpInf): A linear dynamical system ( $\dot{\mathbf{q}}_r = \mathbf{A}_r \mathbf{q}_r + \mathbf{B}_r a(t)$ ) is identified from data using Operator Inference. This captures the dominant linear physics of the flow.
Nonlinear Correction (Neural ODE - NODE): A Neural Ordinary Differential Equation ( $\mathbf{F}_\omega(\mathbf{q}_r, a)$ ) is trained to learn the residual nonlinear dynamics that the linear model misses.
Adaptive Update Loop:
- The linear operators ( $\mathbf{A}_r, \mathbf{B}_r$ ) are identified from an initial dataset and frozen.
- The nonlinear NODE parameters ( $\omega$ ) are continuously updated as new data is collected from the CFD environment.
- This allows the model to adapt to the flow regime changes induced by the controller without retraining the entire model from scratch.

B. Controller Optimization via Differentiable Simulation

Differentiable Solver: The ROM is solved using a differentiable time-integration scheme (Runge-Kutta 4).
Gradient-Based Optimization: The controller parameters ( $\theta$ ) are optimized by backpropagating gradients through the ROM solver to minimize a cost function (e.g., drag reduction or disturbance suppression).
Iterative Process:
1. Deploy current policy in CFD to collect data.
2. Update the NODE component of the ROM with new data.
3. Optimize the controller on the updated ROM using gradient descent (Adam optimizer).
4. Repeat until convergence.

C. Low-Dimensional Representations

The framework supports two types of state representations:

POD-ROM: Uses Proper Orthogonal Decomposition coefficients (requires full-field data).
SS-ROM (Sparse Sensor ROM): Uses measurements from a few sparse sensors (more practical for experiments).

3. Key Contributions

Critic Replacement: The primary innovation is replacing the black-box DRL critic with a physics-informed, adaptive ROM. This shifts the learning burden from "learning a value function from scratch" to "learning a dynamic model," which is significantly more sample-efficient.
Hybrid Modeling Strategy: The combination of OpInf (for linear dynamics) and NODE (for nonlinear residuals) provides a robust structure that leverages physical insights while maintaining data-driven flexibility.
Adaptive Refinement: The framework introduces an iterative loop where the ROM is refined online, allowing the controller to improve as the model accuracy increases, rather than relying on a static surrogate.
Differentiable Simulation: The integration of automatic differentiation with ROMs enables efficient, gradient-based controller optimization, avoiding the need for gradient-free black-box optimization methods.

4. Results and Validation

The framework was validated on two canonical flow control problems:

Case 1: Blasius Boundary Layer (Convectively Unstable Flow)

Goal: Suppress Tollmien-Schlichting (TS) waves.
Setup: Linear flow dynamics (small perturbations).
Performance:
- The method reduced the process to a single-episode system identification and controller optimization.
- The resulting controllers (Proportional, 1st-order, 2nd-order) significantly outperformed traditional Linear-Quadratic-Gaussian (LQG) designs based on the Eigensystem Realization Algorithm (ERA).
- H2 Norm Reduction: The proposed method achieved a 22.5% reduction in the H2 norm compared to ERA-based controllers for proportional control, and successfully stabilized higher-order controllers that were unstable with ERA.
- Comparison: Achieved performance comparable to DRL methods (Xu & Zhang, 2023) but with minimal data (1 episode vs. 50+ episodes).

Case 2: Flow Past a Square Cylinder (Globally Unstable Flow)

Goal: Drag reduction via vortex shedding suppression.
Setup: Nonlinear flow dynamics ($Re=100$).
Performance:
- Using only 4 sparse sensors for feedback, the method achieved a 7.2% drag reduction.
- Sample Efficiency: The optimal policy was identified within 3 to 4 episodes (approx. 43 vortex shedding periods).
- Comparison:
  - Outperformed model-free DRL baselines (TD3 and SAC), which failed to converge or required significantly more data and sensors (e.g., 42–151 sensors in literature) to achieve similar results.
  - Surpassed a POD-Galerkin ROM-based controller (Lasagna et al., 2016) which only achieved 3.6% drag reduction.
- Training Modes: A "Mixed" training strategy (starting with Open-Loop training, then switching to Closed-Loop) proved most effective for stability and convergence.

5. Significance and Implications

Bridging the Gap: This work successfully bridges the gap between model-based control (efficient but limited to linear/simple models) and model-free DRL (powerful but data-inefficient).
Practical Viability: By demonstrating success with sparse sensors and few episodes, the method moves active flow control closer to real-world engineering applications where data collection is expensive and sensors are limited.
Scalability: The framework addresses the "curse of dimensionality" in flow control by operating in a low-dimensional latent space while preserving the ability to learn complex nonlinearities.
Future Directions: The authors note that while the current work focuses on 2D laminar flows, the framework is a foundation for tackling 3D turbulent flows, potentially by incorporating stochastic ROMs and ensemble methods to handle chaos and noise.

In summary, this paper presents a paradigm shift in flow control by utilizing adaptive, physics-informed reduced-order models as the core engine for reinforcement learning, achieving state-of-the-art control performance with a fraction of the data typically required by deep reinforcement learning.