COOL-MC: Verifying and Explaining RL Policies for Platelet Inventory Management

Imagine you are the manager of a very special, high-stakes grocery store. But instead of selling apples or bread, you sell platelets—a type of blood cell that helps people clot and stop bleeding.

Here's the catch: These "groceries" are incredibly fragile. They expire in just five days.

The Impossible Balancing Act

Your job is to order just the right amount every day.

Order too much? The extra platelets rot before anyone can use them. This is a waste of a rare, life-saving resource and costs the hospital money.
Order too little? A patient arrives who needs a transfusion, but you have none. This is a life-or-death emergency.

In the past, humans tried to guess the right amount using math formulas. But demand is unpredictable (like a sudden flu outbreak), and the math gets too complicated. So, scientists turned to Artificial Intelligence (AI), specifically something called Reinforcement Learning (RL).

Think of the AI as a super-intelligent apprentice. You let it run the store for thousands of days in a computer simulation. It makes mistakes, gets "fined" for waste or shortages, and eventually learns a perfect strategy.

The "Black Box" Problem

Here's the trouble: The AI learns its strategy inside a neural network, which is like a giant, tangled ball of yarn. It knows what to do, but no one knows why.

If the AI orders 14 units on a Tuesday, is it because it's Tuesday? Because the weather is rainy? Or because it has 3 units of "fresh" stock and 2 units of "old" stock?
In a hospital, you can't just say, "Trust the robot." You need to know why it made that decision before you let it run a blood bank.

Enter COOL-MC: The AI Detective

This paper introduces a tool called COOL-MC. Think of it as a detective and a translator rolled into one. It takes the AI's "black box" brain and turns it into a clear, transparent map that humans can read.

Here is how COOL-MC solves the mystery, using simple analogies:

1. The "Reachable Map" (Simplifying the Maze)

The AI's world is huge, with millions of possible scenarios. Checking every single one is like trying to read every page of every book in a library to find one sentence.

COOL-MC's Trick: It only builds a map of the places the AI actually visits. It ignores the empty, unused rooms. This makes the map small enough to analyze quickly.

2. The "Safety Inspector" (Checking the Rules)

Once the map is built, COOL-MC asks strict questions, like a safety inspector:

"What is the exact chance that we run out of blood in the next 200 days?"
"What is the chance we have too much blood that will rot?"
The Result: The AI's strategy was found to be very safe. It has only a 2.9% chance of running out and a 1.1% chance of having too much. It passed the test!

3. The "Feature Pruning" (The Blindfold Test)

To understand why the AI makes decisions, the researchers played a game of "What if?"

They put a blindfold on the AI by hiding one piece of information at a time (like hiding the "Day of the Week" or the "Age of the blood").
The Discovery: When they hid the age of the blood (how old the platelets are), the AI's performance crashed. It started running out of stock or wasting blood.
The Lesson: The AI learned that freshness is everything. It barely cares what day of the week it is; it cares deeply about whether the blood is 1 day old or 4 days old.

4. The "Time Traveler" (Counterfactuals)

Finally, they asked: "What if we forced the AI to order less?"

They took a specific order size (14 units) that the AI liked and forced it to order only 6 units instead, in all the situations where it usually ordered 14.
The Surprise: The safety numbers barely changed!
The Lesson: This means the AI was only ordering 14 units when it had a huge "safety buffer" of blood already. It wasn't being greedy; it was being cautious. If it had ordered less, it would have been just as safe.

The Big Takeaway

This paper proves that we can use AI to manage life-saving supplies, but we can't just trust the "magic box." We need tools like COOL-MC to:

Verify that the AI won't kill anyone (Safety).
Explain exactly what the AI is thinking (Transparency).

It turns a mysterious, scary robot into a transparent, auditable partner that doctors and blood bank managers can actually trust with human lives.

1. Problem Statement

The paper addresses the critical challenge of platelet inventory management in blood banks. Platelets have a short shelf life (approx. 5 days) and uncertain daily demand, creating a complex trade-off:

Overstocking: Leads to costly wastage due to expiration.
Understocking: Leads to life-threatening shortages for patients.

While Reinforcement Learning (RL) has been proposed to learn optimal ordering policies for this Markov Decision Process (MDP), standard RL approaches suffer from two major limitations in safety-critical healthcare domains:

Black-box Nature: Neural network policies are opaque, making it difficult for human managers to understand why a specific order quantity was chosen.
Lack of Formal Safety Guarantees: Standard RL evaluation relies on aggregate cost metrics (e.g., average penalty), which do not provide formal guarantees on specific safety properties (e.g., "What is the exact probability of a stockout within 200 days?").

Existing methods often fail to formally verify temporal safety properties or provide systematic, probabilistic explanations for the decisions made by trained RL agents.

2. Methodology

The authors apply COOL-MC, a framework that integrates RL, probabilistic model checking, and explainable AI. The methodology proceeds in four stages:

A. MDP Encoding

The platelet inventory problem is modeled as an MDP inspired by Haijema et al. [81], encoding:

State Space: 8 features including the day of the week, pending orders, and an age-structured inventory vector ( $x_1$ to $x_5$ representing units with 1 to 5 days of shelf life remaining).
Actions: 31 discrete ordering levels ( $pr_0$ to $pr_{30}$ ).
Dynamics: A two-phase daily cycle where the controller orders (Phase 0), followed by stochastic demand realization and inventory aging/expiration (Phase 1).
Penalties: A cost ratio of 5:1 (Shortage : Wastage).

B. Policy Training

A Proximal Policy Optimization (PPO) agent is trained on the MDP for 25,000 episodes using a neural network with three hidden layers. The goal is to minimize the combined cost of shortages and outdating.

C. Induced DTMC Construction

Instead of analyzing the full MDP (which suffers from the "curse of dimensionality"), COOL-MC constructs a Policy-Induced Discrete-Time Markov Chain (DTMC).

It performs a depth-first traversal starting from the initial state, following only the actions selected by the trained policy $\pi$ .
This resolves all non-determinism, creating a DTMC that represents the reachable state space under the specific policy.
This approach reduces the state space by over 99.6% compared to the full MDP, making formal verification computationally feasible.

D. Verification and Explanation

Using the Storm model checker and PCTL (Probabilistic Computation Tree Logic), the framework performs:

Formal Verification: Checking properties like $P_{\le 0.05}(\diamondsuit_{\le 200} \text{empty})$ (Is the stockout probability within 200 steps less than 5%?).
Explainability Techniques:
- Feature Pruning: Systematically removing input features (e.g., setting day-of-week to zero) to measure the impact on safety probabilities.
- Feature-Importance Permutation: Randomly permuting feature values per state to identify which features drive decisions locally.
- Action Labeling: Annotating states with the chosen action to query specific behavioral patterns (e.g., "Does the policy ever order 14 units?").
- Counterfactual Analysis: Replacing specific actions (e.g., changing a 14-unit order to a 6-unit order) in the policy and re-verifying the DTMC to assess "what-if" scenarios without retraining.

3. Key Contributions

First Formal Verification of RL in Platelet Management: This is the first study to apply formal probabilistic model checking to an RL-based platelet inventory policy.
Integration of COOL-MC: Demonstrates the utility of combining RL with probabilistic model checking to provide both quantitative safety guarantees and qualitative behavioral explanations.
Scalability via Reachable States: Proves that constructing the induced DTMC (reachable states only) allows for the verification of policies in high-dimensional inventory problems where full-MDP model checking is intractable.
Comprehensive Behavioral Profiling: Moves beyond aggregate cost metrics to reveal structural insights, such as which inventory features drive decisions and how robust the policy is to specific ordering strategies.

4. Key Results

The analysis of the trained PPO policy yielded the following findings:

Safety Performance:
- Stockout Probability: 2.9% within a 200-step horizon.
- Inventory-Full (Wastage) Probability: 1.1% within the same horizon.
- Note: While the optimal policy (via full-MDP) is theoretically better, the PPO policy is significantly safer than random or heuristic baselines, and the verification provides exact probabilistic bounds.
Feature Importance (Explainability):
- Stockout Avoidance: The policy relies heavily on the freshest inventory ( $x_4, x_5$ ). Pruning these features increased stockout probability by over 1,000%.
- Wastage Avoidance: The policy relies on the oldest inventory ( $x_1$ ). Pruning $x_1$ increased wastage probability by 37%.
- Irrelevant Features: The "day of the week" and "pending orders" had negligible impact on safety probabilities, suggesting the policy learned to infer necessary timing from the inventory age distribution alone.
Behavioral Insights:
- Action Diversity: The policy employs a diverse replenishment strategy, but 7 out of 31 possible order quantities are never selected.
- Counterfactual Robustness: Replacing medium-large orders (14 units) with smaller ones (6 units) in 11.8% of reachable states resulted in negligible changes to safety probabilities (stockout shifted from 2.9% to 2.87%). This indicates the policy places these larger orders in states with sufficient inventory buffers.

5. Significance

This paper bridges the gap between data-driven RL and safety-critical formal methods in healthcare supply chains.

Trust and Adoption: By providing formal proofs of safety (e.g., "Stockout risk is < 3%") and clear explanations of decision drivers (e.g., "We order based on shelf-life, not the day of the week"), the framework addresses the "black box" barrier preventing RL adoption in blood banks.
Auditability: The ability to perform counterfactual analysis allows managers to simulate policy changes (e.g., "What if we reduce order sizes?") and verify safety outcomes instantly without costly retraining.
Generalizability: The COOL-MC pipeline is applicable to other healthcare MDPs (e.g., sepsis treatment, as noted in the code repository), offering a blueprint for transparent, auditable AI in critical infrastructure.