Imagine a tiny, invisible school of 16 robotic fish trying to swim upstream in a human artery. But there's a catch: the blood isn't flowing steadily like a river. Instead, it's pulsing like a beating heart—rushing forward fast, then slowing down, then briefly flowing backward, and repeating this cycle over and over.

This paper describes how the researchers taught these tiny robots to swim against this chaotic, pulsing current without getting swept away, wasting energy, or jerking around uncontrollably. They did this using a "smart teacher" system called Multi-Objective Multi-Agent Reinforcement Learning.

Here is the breakdown of their journey, explained through simple analogies:

1. The Problem: The "Scallop" Trap

At the microscopic size of these robots, water feels thick and sticky, like honey. If a robot tries to swim by opening and closing its "shell" (like a scallop), it just goes nowhere because the water pushes it back exactly as hard as it pushes forward. This is known as the "Scallop Theorem."

To move, they need to wiggle or spin in a specific, non-repeating way. But when the river (blood) itself is surging forward and backward, it's incredibly hard to figure out the right move. If they just push hard upstream, the backward flow might slam them into the wall. If they try to hide, the forward rush might blast them past the finish line.

2. The Solution: A Three-Headed Coach

The researchers didn't just tell the robots, "Go upstream!" They gave them a coach with three different goals (objectives) that often fight against each other:

Goal A (Progress): "Get to the finish line!"
Goal B (Energy): "Don't waste your battery!"
Goal C (Smoothness): "Don't jerk around; move gracefully."

Usually, trying to do all three at once confuses the robots. If they push hard to make progress, they waste energy and move jerkily. If they move smoothly, they might not make enough progress.

3. The Secret Sauce: "Gradient Surgery" (PCGrad)

This is the paper's most critical discovery. The researchers found that without a special tool called PCGrad (Projected Conflicting Gradient), the robots' brains would get confused.

Think of it like a car with three drivers fighting over the steering wheel:

Driver A yells, "Turn left!" (Progress)
Driver B yells, "Turn right!" (Energy)
Driver C yells, "Don't turn at all!" (Smoothness)

Without the surgery, the car would spin in circles or stall. The "surgery" is a mathematical trick that takes the conflicting instructions, cuts out the parts that fight each other, and keeps only the parts that work together. It's like a referee who says, "Driver A, you can turn left, but only as long as it doesn't ruin Driver B's fuel plan."

The paper proves that without this surgery, the robots fail completely. Their energy efficiency drops to zero, and they stop moving smoothly, even though they are still trying to swim.

4. What the Robots Learned (The "Aha!" Moments)

The robots weren't told how to swim; they just learned by trial and error. Surprisingly, they invented three clever strategies that the researchers didn't program:

The "Traffic Jam" Trick (Phase 1): When the blood rushes forward at high speed (like a tsunami), the robots don't fight it. Instead, half of them stick to the bottom wall, and the other half stack on top of them. They form a two-layer "dam" across the tube. This slows the water down right next to them, preventing the current from blasting them away. They let the water push them gently downstream, but in a controlled way, rather than getting swept away.
The "Ratchet" Move (Phase 2): When the blood flow reverses (flows backward), the robots break their formation, spread out, and use that backward flow to their advantage. They swim upstream against the backward current, effectively "ratcheting" themselves closer to the goal. It's like a climber who slides down a bit to get a better grip, then climbs higher.
The "Solo Sprint" (Phase 3): Once they are close to the finish line, they stop acting as a team. They scatter and swim individually to the end. The team formation was only needed to survive the dangerous middle part of the river.

5. The Result

The robots learned to:

Swim upstream successfully (Progress score: 6.5–7.0).
Save energy (Efficiency score: 0.63–0.65).
Move smoothly (Smoothness score: 0.97–0.99).

In contrast, robots that tried to just "push hard" (the brute-force method) got stuck, wasted all their energy, or crashed into the walls.

Summary

This paper shows that by using a smart learning system with a "conflict-resolution" tool (PCGrad), a swarm of tiny robots can learn to navigate a beating heart's blood flow. They learned to act like a team to slow down the water, then act like individuals to climb upstream, all while saving energy. The key takeaway is that you cannot teach robots to do multiple complex things at once without a special method to stop their different goals from fighting each other.

Technical Summary: Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning

Problem Statement

Coordinating micro-robotic swarms in physiologically realistic, time-dependent fluid environments remains a significant challenge for biomedical and environmental applications. At microscopic scales, viscous forces dominate inertial effects, rendering reciprocal actuation ineffective (Purcell's "Scallop Theorem"). Furthermore, in oscillatory flows such as pulsatile arterial blood or pump-induced pipeline cycles, micro-swimmers face cyclic shear gradients, flow reversals, and transient boundary layers that can trap them in recirculation zones or force them against walls.

Existing control paradigms often rely on global actuation with model-predictive control (MPC) or decentralized bio-inspired heuristics. However, these approaches struggle with the computational costs of high-fidelity fluid simulations, the non-stationarity of oscillatory flows, and the difficulty of balancing competing objectives (e.g., upstream progression vs. energy conservation) without explicit inter-agent communication. Crucially, no prior work has integrated multi-objective multi-agent reinforcement learning (MO-MARL) with high-fidelity, time-dependent Computational Fluid Dynamics (CFD) to address swarm locomotion in such dynamic regimes.

Methodology

The authors propose a hybrid CFD-MO-MARL framework that directly couples a high-fidelity incompressible Navier-Stokes solver with decentralized multi-agent reinforcement learning.

Physical Setup and Simulation

Domain: A 2 mm wide, 100 mm long 2D channel filled with blood-mimicking fluid ( $\rho = 1060$ kg/m³, $\mu = 3 \times 10^{-3}$ Pa·s).
Flow Profile: A triphasic arterial waveform (1 Hz cycle) featuring a systolic peak of 400 mm/s, an early diastolic reversal (-15 mm/s), and a late diastolic forward flow (8 mm/s).
Swarm: 16 magnetically actuated micro-robots (modeled as spheres with $r=250$ µm) arranged in a grid. They are subject to hydrodynamic forces, drag, internal propulsive forces (bounded by physical magnetic actuation limits), and contact forces.
Solver: The simulation uses the PhiFlow framework with a semi-Lagrangian advection scheme and projection-based pressure correction on a uniform Cartesian grid ( $\Delta x = 0.1$ mm).

Reinforcement Learning Framework

The control problem is formulated as a Multi-Agent Multi-Objective Markov Decision Process (MA-MOMDP) using a Centralized Training, Decentralized Execution (CTDE) paradigm with Proximal Policy Optimization (PPO).

State Space: Each agent observes local Cartesian coordinates, velocity components, and four pressure samples around its circumference. The critic utilizes the joint state of all agents.
Action Space: Each agent outputs a continuous 2D propulsive force vector.
Multi-Objective Reward: The system optimizes three concurrent objectives:
1. Progress: Upstream displacement against the flow.
2. Energy Efficiency: The ratio of instantaneous work done to maximum possible work.
3. Smoothness: Temporal consistency of actuation (cosine similarity between consecutive actions).
Gradient Conflict Resolution: To address the structural conflict between objectives, the authors employ Projected Conflicting Gradient (PCGrad). This technique projects conflicting gradient components into orthogonal subspaces, preventing the dominant progress objective from destructively interfering with energy and smoothness objectives.

Key Contributions

CFD-MO-MARL Integration: The paper presents the first framework coupling high-fidelity, time-dependent Navier-Stokes solvers with decentralized multi-objective multi-agent RL for micro-swarm control.
Necessity of Gradient Surgery: The study demonstrates that gradient conflict resolution (PCGrad) is a structural requirement, not an optional refinement, in this domain. Without it, energy efficiency and smoothness rewards collapse to near zero, and progress exhibits persistent instability.
Emergent Behavioral Strategies: The framework discovers complex, non-intuitive collective behaviors without explicit encoding in the reward function, including:
- Hydrodynamic Throttling: A two-layer formation that suppresses peak channel velocities during forward flow.
- Cycle-Synchronized Ratchet: A mechanism exploiting flow reversals for upstream repositioning.
- Individualized Final Approach: A transition to independent navigation as agents near the success boundary.

Results

Performance: The converged policy achieves a progress reward of 6.5–7.0, an energy efficiency of 0.63–0.65, and smoothness of 0.97–0.99. This represents an improvement of over 8 reward units in progress compared to brute-force baselines, which yield negative energy efficiency throughout training.
Ablation Study: Removing PCGrad results in the immediate collapse of energy and smoothness rewards within 10,000 steps and persistent large-amplitude oscillations in progress reward. This confirms that naive gradient summation fails to reconcile competing objectives in high-fidelity fluid environments.
Emergent Behaviors:
- Phase 1 (Forward Flow): The swarm forms a two-layer obstruction, reducing local fluid velocity from ~700 mm/s to ~400 mm/s, allowing passive downstream drift within a safe corridor.
- Phase 2 (Reverse Flow): The swarm disperses and re-anchors near the lower wall to advance upstream, acting as a ratchet.
- Phase 3 (Approach): As agents near the target, collective coordination dissolves into individualized navigation.

Significance and Claims

The paper claims to establish a scalable and physically grounded paradigm for micro-swarm control. By capturing time-dependent fluid-agent interactions directly within multi-objective RL loops, the approach offers a method for learning control strategies that respect physical constraints (incompressibility, momentum conservation) while discovering non-intuitive solutions.

The authors assert that this work bridges a critical gap in translating micro-robotic swarms to dynamic, physiological, and industrial environments. The results suggest that time-dependent fluid interactions can be managed without surrogate modeling, offering a template for control domains governed by PDE dynamics. The findings are positioned as applicable to biomedical navigation (e.g., targeted drug delivery in pulsatile vessels), environmental monitoring, and industrial microfluidics.

The study concludes that gradient conflict resolution is essential for stable learning in physically grounded MO-MARL systems where objectives carry heterogeneous gradient magnitudes, and that the discovered emergent behaviors represent a genuine policy discovery driven by the physical consistency of the coupled CFD environment.

Micro-Swarm Locomotion Optimization in Dynamic Flow using Multi-Objective Multi-Agent Reinforcement Learning