Reinforcement Learning for Antibiotic Stewardship: Optimizing Prescribing Policies Under Antimicrobial Resistance Dynamics

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine a world where our most powerful weapons against bacteria—antibiotics—are slowly losing their edge. This is Antimicrobial Resistance (AMR). It's like a game of "Rock, Paper, Scissors" where the bacteria are learning to beat our rocks every time we use them too often.

The big question for doctors is: How do we use these drugs wisely? We need to cure the sick person in front of us right now, but if we use too many drugs, we create "superbugs" that will hurt everyone in the future. It's a balancing act between today's patient and tomorrow's community.

This paper by Joyce Lee and Seth Blumberg is like a high-tech flight simulator for doctors. Instead of testing new rules on real patients (which is risky), they built a computer game called abx_amr_simulator to see how different strategies play out over time.

Here is the breakdown of their adventure, using some everyday analogies:

1. The Game: The "Leaky Balloon"

The core of their simulation is a concept they call the "Leaky Balloon."

The Balloon: Represents the resistance level of a specific antibiotic.
Pumping Air: Every time a doctor prescribes that antibiotic, they pump air into the balloon. The balloon gets bigger (resistance goes up).
The Leak: If the doctor stops prescribing it for a while, the air slowly leaks out, and the balloon shrinks (resistance goes down).
The Goal: The doctor (the "Agent") wants to keep the balloon from popping (resistance becoming 100%) while still helping patients.

2. The Players: The "Smart Pilots" (AI) vs. The "Old Maps" (Fixed Rules)

The researchers tested two types of "pilots" to see who could fly this plane best:

The Old Maps (Fixed Rules): These are like doctors who follow a strict, unchanging rulebook.
- Rule A: "Always pick the drug with the lowest current resistance."
- Rule B: "Always pick the drug that works best for this specific patient."
- Problem: They don't learn. They don't adapt if the weather changes.
The Smart Pilots (Reinforcement Learning): These are AI agents that learn by trial and error. They get points for curing patients and lose points if the balloon gets too big. They try to figure out the perfect long-term strategy.

The researchers tested four different "flight conditions" (Experiment Sets) to see how the pilots handled different levels of difficulty.

3. The Four Flight Conditions

Condition 1: The Clear Day (Perfect Information)

The Scenario: The pilot can see everything perfectly. They know exactly how sick every patient is and exactly how big the resistance balloons are right now.
The Result: The "Smart Pilots" did okay, but they needed a special trick to win. A simple pilot (Flat AI) got confused by the long-term consequences. However, a Hierarchical Pilot (an AI that thinks in "chapters" rather than just "moments") did great. It learned to plan ahead, realizing that saving a drug today helps tomorrow.

Condition 2: The Foggy Window (Delayed & Noisy Data)

The Scenario: In the real world, doctors don't know the exact resistance levels instantly. They get reports that are old, blurry, or slightly wrong.
The Twist: The researchers gave the AI a "memory" (like a human remembering past reports) to help it guess what's happening in the fog.
The Surprise: The memory hurt the AI!
- Why? The "memoryless" AI learned a clever trick: "I only treat when I get a fresh report, then I stop until the next one." This gave the balloon time to leak out.
- The "memory" AI kept treating patients even when the data was old, which kept pumping air into the balloon. Sometimes, forgetting (or ignoring stale data) is better than remembering!

Condition 3: The Mixed Crowd (Patient Differences)

The Scenario: Not all patients are the same. Some are very sick (High Risk), and some are barely sick (Low Risk).
The Result: When the AI could tell the difference between a "High Risk" and "Low Risk" patient, it became a hero.
- It treated the sick ones aggressively.
- It didn't treat the healthy ones (saving the drugs for when they are really needed).
- The Cool Finding: The AI actually did better when it exaggerated the differences! If it thought the sick patients were super sick and the healthy ones were super healthy, it became even more careful with the drugs. It's like being so scared of a storm that you never leave the house, which keeps you perfectly safe.

Condition 4: The Storm (Everything at Once)

The Scenario: This was the hardest level. The AI had to deal with:
- 10 patients at once (a busy ER).
- Noisy, delayed data.
- Different types of patients.
The Result: The Hierarchical Smart Pilots crushed the competition.
- They didn't just beat the "Old Maps"; they beat them by a huge margin.
- They cured more patients and kept the resistance balloons tiny.
- They learned to be conservative. They realized that in a storm, you don't waste fuel. They saved the antibiotics for the truly critical moments, creating a stable, low-resistance environment.

4. The Big Takeaways (The "Aha!" Moments)

Thinking in Chapters Matters: Simple AI that just looks at the "now" fails. You need an AI that understands the "story" of the treatment (Hierarchical AI). It's the difference between a driver who only looks at the bumper in front of them vs. one who looks at the whole map.
Sometimes, Less Info is More: In the foggy conditions, having a "memory" of old data made the AI worse. It was better to wait for a fresh signal and then act decisively.
Risk Stratification is Key: If you can tell who is truly sick and who isn't, you can save the drugs. Even if your risk assessment isn't perfect, being slightly too cautious about who needs treatment actually helps the whole community.
AI Can Learn Stewardship Without Being Told: The AI was only told to "cure the patient." It wasn't told "don't create superbugs." Yet, it figured out on its own that saving the drugs was the only way to keep curing patients in the long run.

5. The Caveats (The "But...")

The authors are honest: This is a simulator, not a real hospital.

They simplified the bacteria (ignoring specific species).
They assumed the world doesn't change drastically over time.
They had one "central brain" making all decisions, unlike a real hospital with many different doctors.

Conclusion

This paper is a proof-of-concept. It shows that Artificial Intelligence can learn to be a better antibiotic steward than rigid rules, especially when the data is messy and the patients are different.

It suggests that in the future, we might use AI not just to predict which drug works, but to manage the entire ecosystem of antibiotic use, ensuring these life-saving drugs don't run out for the next generation. It's like teaching a new driver how to drive not just to get to the store, but to keep the car running for the next 100 years.

1. Problem Statement

Antimicrobial resistance (AMR) poses a critical threat to global public health, yet quantitatively evaluating antibiotic stewardship strategies is difficult due to partial observability and delayed feedback. In real-world settings, clinicians lack complete information regarding:

Patient Heterogeneity: True infection risks and treatment responses vary but are often imperfectly observed.
Resistance Dynamics: Surveillance data (e.g., antibiograms) are often delayed, noisy, biased, and updated infrequently (e.g., annually), failing to reflect real-time community resistance levels.
Long-Horizon Trade-offs: Prescribing decisions impact immediate patient outcomes but also drive long-term resistance evolution, creating a complex credit assignment problem where short-term gains may lead to long-term failures.

Existing literature often focuses on pathogen-level evolutionary models or static supervised learning for prediction, lacking a framework to optimize sequential prescribing policies that balance individual clinical benefits against population-level resistance under uncertainty.

2. Methodology

2.1 Simulation Framework: `abx_amr_simulator`

The authors developed a Gymnasium-compatible simulation environment (abx_amr_simulator) to model the interaction between prescribing decisions and AMR dynamics.

Patient Population: Modeled via a PatientGenerator creating synthetic patients with attributes (infection probability, clinical benefit/failure multipliers, spontaneous recovery). Populations can be homogeneous or heterogeneous (high-risk vs. low-risk subpopulations).
Resistance Dynamics: Modeled using AMR_LeakyBalloon classes. Resistance acts as a "soft-bounded accumulator": prescribing increases latent pressure (inflating the balloon), while lack of prescribing allows decay (deflating). Observable resistance is a sigmoid mapping of this latent pressure. The model supports cross-resistance.
Reward Function: A scalar reward balancing Individual Reward (clinical success/failure) and Community Reward (penalty for high AMR). Notably, in all experiments, the community weight ( $\lambda$ ) was set to 0, meaning agents were optimized only for individual clinical outcomes. This tests if stewardship behaviors emerge naturally from long-horizon dynamics.
Observability: The framework allows systematic manipulation of information quality, introducing noise, bias, and temporal delays to both patient attributes and AMR levels.

2.2 Reinforcement Learning Agents

The study evaluated four PPO (Proximal Policy Optimization) agent architectures using stable-baselines3:

Flat Memoryless: Decides based only on the current observation.
Flat Recurrent: Uses LSTM to maintain a memory of past observations.
Hierarchical Memoryless: Uses a "Manager" to select high-level options (macro-actions) and "Workers" to execute them. Options included deterministic sequences and heuristic rules.
Hierarchical Recurrent: Combines hierarchical structure with recurrent memory.

2.3 Experimental Design

The study utilized a three-phase pipeline (Hyperparameter tuning via Optuna, Training, Evaluation) across four experiment sets of increasing complexity:

Set 1 (Perfect Observability): Homogeneous patients, perfect AMR data. Used Value Iteration (VI) as a ground-truth benchmark.
Set 2 (Delayed/Noisy AMR): Homogeneous patients, but AMR data is delayed (90 timesteps), noisy, and biased.
Set 3 (Heterogeneous Patients, Varied Bias): Mixed high/low-risk populations with perfect AMR data, but patient risk perception is manipulated (Accurate, Exaggerated, or Compressed bias).
Set 4 (Combined Uncertainty): High patient volume (10/timestep), heterogeneous populations, differential observability (low-risk patients have fewer observable attributes), and delayed/noisy AMR data. Compared against fixed prescribing rules (Greedy, Lowest AMR).

3. Key Contributions

Novel Simulation Framework: Introduction of abx_amr_simulator, a flexible tool for stress-testing stewardship strategies under varying degrees of information degradation and population heterogeneity.
Hierarchical RL Superiority: Demonstration that Hierarchical RL (HRL) is essential for solving the long-horizon credit assignment problem in antibiotic prescribing. Flat agents failed to learn effective policies in multi-antibiotic scenarios, whereas HRL agents successfully learned conservative, stewardship-preserving strategies.
Emergent Stewardship: Showed that agents can learn to preserve antibiotic efficacy and stabilize resistance without explicit community-level penalties in the reward function, provided the environment architecture captures long-term consequences.
Insights on Information Structure:
- Memory is Context-Dependent: Recurrent memory did not uniformly improve performance; in some delayed-AMR settings, memoryless agents performed better by adopting conservative "on/off" strategies aligned with update cycles.
- Risk Stratification Value: The ability to differentiate patient risk (heterogeneity) was a major driver of policy quality. Interestingly, exaggerated risk stratification (over-stratification) slightly outperformed accurate stratification by encouraging more conservative withholding of treatment in low-risk groups.
Finite-Horizon Artifacts: Identified a critical artifact where hierarchical agents exploited the known episode end-time to aggressively prescribe near the boundary. This was mitigated in complex, heterogeneous environments (Sets 3 & 4) where high-frequency triage signals reduced the need for boundary exploitation.

4. Key Results

Benchmark Performance: In perfect observability (Set 1), HRL agents matched or exceeded Value Iteration benchmarks in multi-antibiotic scenarios, though their trajectories sometimes differed qualitatively (monotonic escalation vs. stable equilibrium) due to finite-horizon exploitation.
Impact of Delayed AMR (Set 2): Memoryless agents outperformed recurrent agents. Memoryless agents learned to treat AMR updates as discrete signals, adopting binary prescribing patterns that allowed resistance to decay during "stale" periods. Recurrent agents tended to prescribe more continuously, failing to capitalize on recovery windows.
Impact of Heterogeneity (Set 3): Introducing patient heterogeneity significantly improved outcomes. Agents learned to selectively treat high-risk patients and withhold treatment from low-risk ones. Exaggerated risk perception yielded the best results, while compressed perception caused moderate degradation.
Complex Realistic Setting (Set 4): In the most realistic scenario (noise, delay, heterogeneity, high volume), Hierarchical PPO agents significantly outperformed fixed prescribing rules (Greedy and Lowest AMR) on both clinical outcomes (more benefits, fewer failures) and stewardship metrics (lower final AMR levels).
- Fixed rules drove AMR to near 100% (saturation).
- HRL agents converged to low, stable AMR equilibria (~0.08–0.11).
- Hierarchical Recurrent PPO showed a slight edge over memoryless HRL in cross-resistance scenarios, suggesting memory becomes valuable when all information streams are degraded.

5. Significance and Limitations

Significance:

Policy Analysis Tool: The framework serves as a "best-case" analysis tool to determine the theoretical upper bound of stewardship strategies under uncertainty.
Architectural Guidance: It suggests that for complex, long-horizon medical decisions, hierarchical decomposition is more effective than flat or purely recurrent approaches.
Surveillance Implications: The results highlight that while delayed/noisy data is detrimental, the structure of the decision-making agent (hierarchical) and the availability of patient-level triage signals are more critical for success than perfect data.
Resource Allocation: The finding that exaggerated risk stratification helps suggests that clinical tools prioritizing sensitivity in risk identification (even if slightly over-inclusive) may yield better population outcomes than perfectly calibrated but conservative tools.

Limitations:

Abstraction: Pathogens are abstracted (no species-level granularity), and cross-resistance is modeled as strictly positive (ignoring collateral sensitivity).
Stationarity: The model assumes stationary patient populations and resistance dynamics, whereas real-world systems are non-stationary.
Centralized Prescriber: The simulation assumes a single centralized decision-maker, ignoring the decentralized nature of real-world clinical practice.
Finite-Horizon Exploitation: In early experiments, agents exploited the known episode end-time, inflating performance metrics. This was identified as an artifact requiring architectural fixes (removing timestep awareness).

Conclusion:
The study demonstrates that Reinforcement Learning, specifically Hierarchical RL, can learn robust antibiotic prescribing policies that balance individual care with population stewardship, even under significant information degradation. The framework provides a controlled environment to generate hypotheses for stewardship program design and surveillance investment before real-world implementation.