Reinforcement learning for closed-loop optimisation of… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a tiny, living city made of neurons (brain cells) growing on a special chip. This city has a specific layout, like a roundabout with four exits. Your goal is to send electrical "messages" (stimuli) to this city to make the traffic (spikes) flow in a perfect, clockwise circle around the roundabout.

The problem? This city is chaotic. It's not a simple machine where you push a button and get a predictable result. The neurons react differently depending on what happened a split second ago, and there are millions of possible ways to send those messages. Trying to find the perfect message by guessing randomly would take forever—like trying to find a specific grain of sand on a beach by picking one up every second.

Here is how the researchers solved this using a "Smart Coach" (Reinforcement Learning):

1. The Setup: A Living Circuit Board

The scientists grew brain cells on a grid of electrodes (microelectrode arrays). They used tiny, invisible walls (microfluidic channels) to force the cells to grow in a specific shape—a loop. This makes the "city" easier to understand, but it's still a living, breathing system that changes over time.

2. The Challenge: The "Black Box"

If you zap one part of the city, the neurons might fire in a circle. But if you zap the same spot 10 seconds later, they might not. Why? Because the neurons have a "memory" of what just happened.

The Analogy: Imagine trying to teach a dog to sit. If you say "Sit" when the dog is already tired, it might ignore you. If you say it when the dog is excited, it might jump. The dog's reaction depends on its current state. The researchers needed a way to figure out not just what to say, but when to say it based on how the dog (the network) was feeling.

3. The Solution: The Reinforcement Learning Agent

Instead of a human guessing, they built a computer program—an AI Agent—to act as a coach.

The Loop: The AI sends a signal (a "stimulus") to the chip.
The Reaction: The chip sends back a video of the neurons firing (the "response").
The Score: The AI checks the video. Did the neurons fire in a perfect clockwise circle? If yes, the AI gets a high score (reward). If not, it gets a low score.
The Learning: The AI tries millions of different combinations of signals. Over time, it learns: "Ah, when I zap electrode A and then wait 2 milliseconds before zapping electrode B, the neurons dance in a circle!"

4. The Speed: Milliseconds Matter

Previous experiments were slow, like sending a letter and waiting a week for a reply. This new system is a real-time conversation.

The AI sends a signal, waits for the neurons to react (about 20 milliseconds), and decides the next move instantly.
It's like a game of ping-pong played at lightning speed, where the AI learns the perfect rhythm to keep the ball (the signal) moving in a circle.

5. The Surprising Discovery: It's Not What You Expect

The researchers thought the AI would eventually learn to zap the electrodes in a perfect clockwise order (A, then B, then C, then D) to match the circle.

The Reality: The AI found a much weirder, more complex solution. It discovered that sometimes it needed to zap the electrodes in a chaotic order, or skip some entirely, to get the neurons to cooperate.
The Metaphor: Imagine trying to get a group of people to walk in a circle. You might think you need to tell them "Left, Left, Left." But the AI found that sometimes you need to yell "Right, Stop, Jump, Left" to get the group to move in a circle because of how they are reacting to each other. The AI found the "secret handshake" that works, even if it looks random to us.

6. The "State" Factor

The researchers also discovered that the neurons' reaction depends heavily on what happened just before.

The Analogy: If you ask a friend a question, their answer depends on whether they were just laughing or just crying. The AI learned to use this "mood" (state) to its advantage. It learned that if the network was in "Mood X," it should try "Signal Y," but if the network was in "Mood Z," it should try "Signal W."
While the AI got better at using these moods, the simplest strategy (just finding the one best signal) still worked almost as well as the complex mood-sensing strategy.

Why Does This Matter?

This isn't just about making neurons dance in circles. It's a new tool for understanding the brain.

For Science: It gives us a way to systematically map how brain circuits work without needing to know every single connection beforehand.
For Medicine: It could help design better brain implants for people with epilepsy or Parkinson's. Instead of a doctor guessing the right electrical pattern to stop a seizure, an AI could learn the perfect pattern for that specific patient's brain in real-time.

In short: The researchers built a fast, smart robot coach that learned to talk to a living brain chip. It figured out the secret code to make the brain cells move in a circle, proving that we can use AI to understand and control the complex, messy world of biology.

1. Problem Statement

Understanding how biological neuronal circuits transform inputs into outputs is a central challenge in neuroscience. While in vitro networks cultured on Microelectrode Arrays (MEAs) allow for controlled stimulation and recording, two major hurdles prevent systematic exploration of their input-output functions:

Combinatorial Explosion: The space of possible spatiotemporal stimulation patterns grows exponentially with the number of electrodes and temporal resolution, making exhaustive open-loop exploration intractable.
State Dependence: Neuronal responses are not static; they depend on the network's recent spiking history (prior stimulation). This non-stationarity complicates the mapping of stimuli to responses.
Limitations of Existing Systems: Previous closed-loop systems often operated on slow timescales (seconds), used scalar summaries (e.g., total spike counts) that discarded spatiotemporal structure, or relied on proprietary hardware with limited flexibility.

2. Methodology

The authors developed a comprehensive framework combining topologically constrained biological networks with a custom, low-latency reinforcement learning (RL) system.

A. Biological Platform

Network Architecture: Human iPSC-derived neurons were cultured on 60-electrode MEAs using Polydimethylsiloxane (PDMS) microstructures. These structures confined cell bodies in "open wells" and guided axons through 4µm microchannels connecting specific electrodes, creating engineered recurrent networks with constrained topology.
Culture: Networks were maintained in a custom incubator ("inkudock") allowing for weeks of stable operation with precise temperature control.

B. Hardware & Software System (inkube)

Custom Electrophysiology: Built on the open-source inkube platform using off-the-shelf components (Intan RHS2116 ASICs, Xilinx SoC).
Closed-Loop Performance: The system achieves millisecond-range round-trip times (stimulation to next action) and single-sample precision (~58 µs).
Architecture:
- Stimulation: Charge-balanced biphasic pulses (50 nA, 461 µs/phase).
- Detection: Real-time spike detection with adaptive thresholds and artifact suppression (active discharge + high-pass switching).
- Communication: ZeroMQ sockets for inter-process communication between the hardware interface and the RL agent.
- Scalability: Supports independent RL agents controlling multiple networks simultaneously on a single MEA.

C. Reinforcement Learning Framework

Task Definition: The goal was to identify stimulation patterns that maximize the length of clockwise-circular firing sequences across the four electrodes of a microchannel network.
Markov Decision Process (MDP):
- State ( $S$ ): The post-stimulus spiking response recorded over 20ms. To reduce dimensionality, states were compressed using Principal Component Analysis (PCA) or a Dilated Convolutional Neural Network (DCNN).
- Action ( $A$ ): A 4-dimensional vector controlling stimulation timing (0–5 ms delay) or absence of stimulation on each of the 4 electrodes. Both discrete (625 combinations) and continuous action spaces were tested.
- Reward ( $R$ ): The length of the longest valid clockwise sequence found, where consecutive spikes must occur within a 0.5–5 ms window (synaptic transmission window).
Agents Tested:
- Multi-Armed Bandits (MAB): State-free agents using Upper Confidence Bound (UCB) exploration.
- Linear Contextual Bandits (LCB): State-dependent agents that model reward as a linear function of the current state and action, allowing for action switching based on history.
- Baselines: Random stimulation and a "Dynamic LCB" variant.

3. Key Results

A. Stability and Separability of Responses

Stability: Evoked responses were stable over hours of continuous stimulation. Approximately 90% of actions showed stationary reward signals (Augmented Dickey-Fuller test, $p < 0.01$ ).
Separability: The reward signal for a specific action was distinct from trial-to-trial variability (inverse coefficient of variation median = 2.67), allowing agents to learn reliable action values.
State Dependence: A significant subset of stimulus pairs (approx. 26–35%) exhibited state dependence, where the response to a current stimulus ( $A_n$ ) was significantly modulated by the previous stimulus ( $A_{n-1}$ ). This was detected using Sliced Wasserstein Distance on compressed state representations.

B. Agent Performance

Learning Capability: All RL agents (MAB and LCB) significantly outperformed random stimulation, learning to induce longer circular firing sequences.
Convergence: Agents converged on non-trivial stimulation patterns that spanned the full action space. They did not simply mirror the clockwise target motif; instead, they discovered complex, non-intuitive timing combinations.
State-Free vs. State-Based:
- MAB (State-Free): Converged quickly to a single high-reward action.
- LCB (State-Based): Successfully learned to exploit state dependence by switching actions based on the previous response.
- Performance Comparison: Despite learning state dependencies, state-based agents did not achieve higher overall rewards than state-free MABs. The benefit of switching actions was measurable for specific pairs but modest on average. This suggests the compressed state representation may not fully capture the network dynamics required to leverage history effectively.

C. Action Space Analysis

The optimal actions did not follow a simple clockwise temporal order. This is attributed to the complex biophysics of extracellular stimulation, which activates axons bidirectionally (orthodromic and antidromic), creating convoluted activation pathways that RL agents had to discover through interaction.

4. Key Contributions

High-Performance Closed-Loop System: Development of a low-cost, open-source, modular electrophysiology system (inkube) capable of millisecond-latency, single-spike resolution control of multiple biological networks simultaneously.
RL for Spatiotemporal Control: Demonstration that model-free RL can efficiently navigate the vast, combinatorial space of spatiotemporal stimulation patterns to achieve specific functional goals in biological networks.
Characterization of State Dependence: Quantitative evidence that in vitro patterned networks exhibit significant, measurable state dependence on short timescales, which can be partially exploited by contextual bandit agents.
Open Science: Full public availability of hardware designs, software code, and data, providing a reproducible platform for goal-directed functional characterization of engineered neuronal networks.

5. Significance and Outlook

Scientific Impact: This work moves beyond simple "black box" stimulation by providing a tool to systematically map input-output functions of biological circuits at the single-spike level. It challenges the assumption that biological networks are purely static reservoirs, highlighting the importance of history-dependent dynamics.
Technological Advancement: The system bridges the gap between biological experimentation and advanced machine learning, enabling real-time adaptation to non-stationary biological substrates.
Future Applications: The platform is a general-purpose tool applicable to:
- Biocomputation: Optimizing networks for specific computational tasks.
- Therapeutics: Developing adaptive control algorithms for deep brain stimulation or vagus nerve stimulation.
- Neuroscience: Disentangling the roles of intrinsic network dynamics vs. external control in information processing.

The authors note that future improvements could involve richer state definitions (e.g., Partially Observable MDPs), optical stimulation to avoid electrical artifacts, and higher-density MEAs to capture more subthreshold dynamics.

Reinforcement learning for closed-loop optimisation of spatiotemporal stimulation in patterned neuronal networks