Machine Learning Guided Cooling System Optimization for Data Center

Imagine a massive, high-tech library called Frontier. This isn't a library for books, but for super-computers that solve the world's hardest problems. To keep these giant brains from overheating, the library has a sophisticated air-conditioning and water-cooling system.

The problem? Even though this library is already very efficient, its cooling system is a bit like a driving instructor who is too cautious. It keeps the engine running at high RPMs and the AC blasting full blast, just to be safe, even when the car is idling at a red light. This wastes a lot of electricity (and money) every year.

This paper is about teaching that cooling system to be a little smarter, using a "digital twin" (a virtual copy) powered by Machine Learning.

Here is the story of how they did it, broken down into three simple steps:

The Cast of Characters

The Supercomputer (Frontier): The heavy lifter. It eats electricity to do math and spits out heat.
The Cooling System: The hero that removes the heat. It uses pumps, fans, and water.
The "Digital Twin" (The AI): A smart computer program that learns exactly how the cooling system should behave based on physics.

Step 1: Building the "Perfect Student" (The Surrogate Model)

First, the researchers taught an AI model using one year of data from the Frontier supercomputer. They didn't just let the AI guess; they gave it rules (physics).

The Analogy: Imagine teaching a student to predict how much gas a car uses. You tell the student: "If you drive faster, you use more gas. If you carry a heavy load, you use more gas." You don't let the student guess that driving faster uses less gas.
What they did: They trained the AI to predict how much electricity the cooling system should be using based on how hot the computers are and how fast the water is flowing.
The Result: The AI became a "Perfect Student." It predicted the energy usage almost perfectly. If the real system used 100 units of energy, the AI said, "You should have used 98 units."

Step 2: Finding the "Wasted Gas" (Excess Monitoring)

Now that they had the "Perfect Student," they compared it to the real system.

The Analogy: You look at your car's trip computer. It says, "Based on your speed and weight, you should have used 10 gallons of gas." But your actual tank shows you used 12 gallons. That extra 2 gallons is wasted.
What they found: By looking at the data, they found that the Frontier cooling system was wasting about 85 MWh of energy over the year.
When did it happen? It wasn't random. The waste happened mostly in the winter and late summer, and mostly during the early morning hours when the computers were doing less work, but the cooling system was still running at full speed (like leaving the AC on in an empty house).

Step 3: The "What If" Game (Counterfactual Optimization)

This is the most exciting part. The researchers asked: "What if we had nudged the thermostat just a tiny bit?"

The Analogy: Imagine you are driving and you see a red light. You know you could have slowed down 5 seconds earlier to coast to a stop instead of braking hard. You didn't do it, but you know you could have saved fuel.
The Safety Guardrails: They couldn't just tell the system to turn off the AC. They set strict rules (guardrails):
- "Don't let the computers get too hot."
- "Don't turn off the pumps completely."
- "Don't change the settings too fast, or the system might get confused."
The Experiment: They ran a simulation where they asked the AI: "If we raised the water temperature by just 0.2 degrees or slowed the water flow by 5%, how much energy would we have saved?"
The Result:
- Theoretical Maximum: If they could change everything perfectly, they could have saved about 82 MWh (recovering 96% of the waste).
- Realistic "Safe" Savings: When they added strict safety checks to make sure the changes were actually safe and not just computer glitches, they found they could still save about 13 to 15 MWh per year.

Why This Matters

Think of the cooling system as a tightrope walker.

Before: The walker was being super safe, staying in the middle of the rope, moving very slowly, and using a lot of energy to stay balanced.
After: The AI showed them that they could take tiny, safe steps closer to the edge (raising the temperature slightly, slowing the flow) without falling.

The Takeaway:
Even in a facility that is already considered "world-class" efficient, there is still a hidden layer of waste. By using Machine Learning to act as a smart, physics-aware coach, they found a way to save energy without risking the safety of the supercomputer. It's not about a giant overhaul; it's about making thousands of tiny, smart adjustments that add up to real savings.

In short: They taught the cooling system to stop over-reacting, saving money and energy while keeping the supercomputer cool and happy.

1. Problem Statement

High-performance computing (HPC) facilities, such as the Frontier exascale supercomputer, consume massive amounts of energy, with cooling systems accounting for a significant portion of non-compute (accessory) power. Even facilities with excellent Power Usage Effectiveness (PUE) scores (e.g., ~1.05) still waste energy due to inefficiencies in cooling plant operations.

The Challenge: Operators are often reluctant to manually adjust cooling setpoints (e.g., supply temperature, flow rates) because the "safe operating envelope" is unclear, and potential savings are often smaller than daily operational noise.
The Gap: Existing methods either rely on black-box deep reinforcement learning (hard to trust/deploy in critical infrastructure) or focus on building-level HVAC rather than the fine-grained liquid-cooling loops specific to supercomputers. There is a lack of transparent, physics-guided frameworks that can identify "micro-inefficiencies" and propose safe, auditable adjustments.

2. Methodology

The authors propose a three-stage, physics-guided machine learning framework using one year of 10-minute resolution operational data from the Frontier facility.

Stage 1: Physics-Guided Surrogate Modeling

Goal: Create a digital twin to predict "expected" accessory power ( $P_{acc}$ ) based on current operating conditions.
Model: A Light Gradient Boosting Machine (LightGBM) with monotonicity constraints.
- Inputs: IT load ( $P_{IT}$ ), coolant supply temperature ( $T_{sup}$ ), subloop return temperatures ( $T_{r,i}$ ), flow rates ( $Q_i$ ), derived heat loads, and temporal features (hour, month, regime).
- Physics Constraints: The model enforces that accessory power must increase (or stay flat) with increased heat load or flow, preventing the model from learning spurious correlations that could lead to unsafe recommendations.
- Calibration: Isotonic regression is applied to correct residual bias.
Outcome: A highly accurate surrogate model that predicts accessory power with a Mean Absolute Error (MAE) of ~0.026 MW.

Stage 2: Excess-Use Characterization

Goal: Quantify energy waste by comparing actual accessory power against the predicted (expected) power from the surrogate.
Process:
- Calculate the residual: $P_{excess}(t) = \max(P_{actual} - \hat{P}_{predicted}, 0)$ .
- Convert power residuals to energy (MWh) and cost over the year.
- Aggregate data by time (hour/day/month) and operating regime to identify when and where inefficiencies are concentrated.
Outcome: Identification of approximately 85.2 MWh of annual excess cooling energy, concentrated in specific months (winter/summer peaks) and low-load periods.

Stage 3: Counterfactual Policy & Safety Guardrails

Goal: Determine if the identified excess can be recovered through small, safe adjustments to setpoints.
Process:
- Counterfactual Simulation: For every 10-minute interval, the system tests hypothetical adjustments:
  - Increasing supply temperature ( $\Delta T_{sup}$ ) by small increments (0.0 to 1.5°C).
  - Reducing subloop flow rates (down to 90% of baseline).
- Guardrails (Safety Constraints): Every proposed action is filtered to ensure:
  - PUE remains $\ge 1.0$ .
  - Total heat removal is preserved (at least 97% of baseline).
  - Temperature lifts and flow rates stay within physical limits.
  - Actions remain within the training data distribution (in-distribution checks).
- Materiality Filter: Only actions yielding savings larger than the model's error margin (MAE) are considered to avoid noise.
Outcome: A list of safe, interpretable "micro-adjustments" with estimated energy and cost savings.

3. Key Contributions

Physics-Guided Surrogate: Development of a monotonicity-constrained LightGBM model that respects thermodynamic laws, serving as a reliable baseline for anomaly detection in liquid-cooled HPC.
Micro-Optimization Framework: A novel approach to identifying "micro-inefficiencies" (small setpoint deviations) rather than requiring major infrastructure changes.
Safe Counterfactual Analysis: A rigorous methodology for evaluating hypothetical control actions using explicit safety guardrails and in-distribution checks, making the results auditable and trustworthy for operators.
Interpretability: The framework provides explicit logs of why a change is recommended (e.g., "reduce flow in Loop 2 by 3% because it is over-cooling"), unlike black-box optimization.

4. Key Results

Surrogate Performance:
- Test MAE: 0.0259 MW (approx. 26 kW).
- PUE Prediction: 98.7% of test samples are within ±0.01 of measured PUE.
- The model is accurate enough to detect deviations smaller than typical operational noise.
Excess Energy Quantification:
- Total annual excess cooling energy identified: 85.2 MWh.
- Estimated cost of waste: ~$5,100 (at $60/MWh).
- Waste is not uniform; it is concentrated in winter (Jan), late year (Dec), and specific operating regimes.
Counterfactual Savings:
- Theoretical Upper Bound: Up to 126.8 MWh could be saved if only physics guardrails were applied.
- Capped by Excess: When limited to the identified excess, 82.1 MWh (96% of the waste) is recoverable.
- Conservative/Reviewer-Pass Estimate: After applying strict safety, materiality, and hysteresis filters, 13.4 MWh (~$810) of credible, safe savings remains.
- Action Scale: The recommended changes are minimal (e.g., median supply temp increase of 0.12°C, flow reduction to 95-99% of baseline), ensuring operational safety.

5. Significance

Operational Viability: The study proves that even in a state-of-the-art facility with a PUE of ~1.05, there is a measurable band of recoverable energy through setpoint tuning rather than hardware replacement.
Trust & Deployment: By combining machine learning with explicit physics constraints and safety guardrails, the framework bridges the gap between AI optimization and the conservative nature of critical infrastructure management. It provides "reviewer-ready" logs that justify every proposed change.
Scalability: The framework is designed to be adaptable to other liquid-cooled data centers with site-specific recalibration, offering a template for sustainable HPC operations.
Future Impact: The methodology lays the groundwork for Model Predictive Control (MPC) and Safe Reinforcement Learning (Safe RL) in data centers, moving from offline analysis to potential online, real-time optimization.

In conclusion, this paper demonstrates that machine learning, when guided by physics and constrained by safety guardrails, can effectively identify and recover significant energy waste in high-performance computing cooling systems without compromising reliability.