Safe Decentralized Operation of EV Virtual Power Plant with Limited Network Visibility via Multi-Agent Reinforcement Learning

Imagine a bustling city where everyone is driving electric cars (EVs) and installing solar panels on their roofs. This is great for the environment, but it creates a chaotic traffic jam for the electricity grid. If too many people plug in their cars at the same time, or if the solar panels suddenly stop producing power when clouds roll in, the voltage in the neighborhood can spike or crash. This is like a water pipe system: if everyone turns on their hoses at once, the pressure drops, and the water stops flowing properly.

To fix this, we need a Virtual Power Plant (VPP). Think of a VPP as a smart, invisible conductor leading an orchestra of thousands of small energy sources (solar panels, batteries, and EV chargers) to play in harmony.

However, there's a big problem: The Conductor is Blindfolded.

In the real world, the VPP doesn't know the exact status of every single wire and lightbulb in the neighborhood. Privacy laws and security rules mean the grid operator only gives the VPP a vague, "foggy" picture of what's happening nearby. It's like trying to conduct an orchestra while only hearing the instruments in the front row, but you can't see or hear the back row. If the conductor guesses wrong, the music (the grid) could crash, causing blackouts or damaging equipment.

The Solution: The "Super-Intelligent" Conductor

The authors of this paper, Chenghao Huang and his team, built a new kind of conductor using Artificial Intelligence (AI). They call it TL-MAPPO. Let's break down what makes it special using a few analogies:

1. The "Time-Traveling" Memory (The Transformer)

Most AI agents are like people with short-term memory; they only react to what's happening right now. If the price of electricity is high now, they stop charging. But they don't realize that the price might drop in 10 minutes.

The authors added a Transformer layer to their AI. Think of this as giving the conductor a time machine or a super-memory. It doesn't just look at the current moment; it looks at the last hour of data (prices, weather, traffic) to understand the pattern. It knows, "Ah, every Tuesday at 6 PM, everyone comes home and plugs in, so I should start preparing the grid before that happens." This helps the AI make smarter, long-term decisions.

2. The "Safety Net" (Lagrangian Regularization)

In the past, AI agents were told: "Try to save money, but don't break the rules." But AI is tricky; it often finds a way to save money by barely breaking the rules, which is dangerous for the grid.

The authors added a Lagrangian system, which acts like a strict referee with a red card.

If the AI tries to save money but risks a voltage crash, the referee immediately slaps a heavy "fine" (a mathematical penalty) on the AI's score.
The AI learns quickly: "I can't cut corners here. I must prioritize safety to win the game."
Crucially, this referee is smart. It doesn't just say "No"; it adjusts the penalty dynamically, teaching the AI exactly how much safety is needed to keep the grid stable without being too wasteful.

3. The "Team of Local Captains" (Multi-Agent Learning)

Instead of one giant brain trying to control every single charger (which is too slow and complex), the system uses a team of local captains.

Each EV charging station has its own AI "captain."
They are trained together in a simulation (Centralized Training) where they can see everything and learn from each other.
But when it's time to work in the real world (Decentralized Execution), they go back to their own stations. They only look at their local "foggy" view (limited data) and make their own decisions based on what they learned.
It's like a sports team practicing together, but during the game, each player has to react to their own position on the field without waiting for the coach to shout instructions.

The Results: A Smoother Ride

The team tested this new system on a realistic model of a power grid (the IEEE 33-bus system). Here is what happened compared to older AI methods:

Fewer Blackouts: The new system reduced voltage violations (the "crashes") by about 45%. It kept the "water pressure" in the pipes much more stable.
Cheaper Bills: It saved about 10% on operational costs. By using its "super-memory," it knew exactly when to charge cars when electricity was cheap and when to hold back.
Happier Drivers: It ensured that EVs got charged on time, so drivers didn't leave with empty batteries.

The Bottom Line

This paper presents a way to manage the chaotic energy needs of electric cars and solar panels without needing a perfect, all-seeing view of the power grid. By combining smart memory (Transformers) with a strict safety referee (Lagrangian) and a team of local captains (Multi-Agent AI), they created a system that keeps the lights on, protects the grid, and saves money, even when the "fog" of limited information is thick.

It's the difference between a conductor who panics when they can't see the back row, and a conductor who has a super-memory and a strict safety net, allowing the whole orchestra to play a perfect symphony even in the dark.

1. Problem Statement

The paper addresses the challenge of coordinating Electric Vehicle Charging Stations (EVCSs) within a Virtual Power Plant (VPP) framework under realistic information constraints.

Context: As distributed energy resources (DERs) like rooftop PV and EVs grow, VPPs are needed to coordinate them for grid stability. EVCSs are critical assets but can cause significant voltage deviations due to their concentrated, high-throughput charging demands.
Core Challenge: In practice, VPPs do not have full visibility of the Power Distribution Network (PDN) due to privacy, regulatory, and cybersecurity constraints. They only receive partial, aggregated information (e.g., local neighborhood voltages and loads) from the Distribution System Operator (DSO).
Gap: Existing Multi-Agent Reinforcement Learning (MARL) approaches often assume full grid-state visibility or lack robust safety guarantees during learning. This leads to unsafe decisions (voltage violations) or inefficient operations when deployed in real-world, partially observable environments.

2. Methodology: TL-MAPPO

The authors propose TL-MAPPO (Transformer-assisted Lagrangian Multi-Agent Proximal Policy Optimization), a safety-enhanced framework designed for decentralized execution under centralized training. The methodology consists of three integrated components:

A. Problem Formulation (PO-CMDP)

The coordination problem is modeled as a Partially Observable Constrained Markov Decision Process (PO-CMDP).

Observations: Each EVCS agent $k$ observes only its local 1-hop neighborhood (voltage magnitudes and aggregated loads of adjacent buses), local PV generation, electricity prices, and EV state-of-charge (SoC) trajectories.
Objective: Minimize total operational costs (energy trading, battery degradation, and charging dissatisfaction) while strictly adhering to voltage safety limits ( $V_{min} \leq v_i \leq V_{max}$ ).

B. Transformer-Based Observation Processing

To compensate for limited visibility and capture long-term dependencies:

A Transformer encoder is deployed on each EVCS agent.
It processes a temporal window of observations (prices, loads, charging demand) to extract compact, high-level temporal representations.
This allows agents to understand temporal correlations (e.g., price spikes or load patterns) that a standard RNN or MLP might miss, improving decision quality despite partial state information.

C. Lagrangian MAPPO (Lag-MAPPO)

The core control algorithm uses a Centralized Training, Decentralized Execution (CTDE) architecture with Lagrangian regularization:

Architecture: Multiple decentralized actors (policies) are trained using two centralized critics: one for the reward (economic cost) and one for the safety cost (voltage violations + demand dissatisfaction).
Safety Enforcement: Instead of simple reward shaping, the framework uses Lagrangian regularization. A multiplier $\lambda$ is dynamically updated via projected dual ascent to penalize constraint violations.
Optimization: The actor optimizes a clipped PPO objective where the advantage function is adjusted by the Lagrangian term ( $\hat{A}_{Lag} = \hat{A}_R - \lambda \hat{A}_C$ ), ensuring that safety constraints are prioritized during policy updates.

3. Key Contributions

Realistic Coordination Framework: Formalized a VPP-DSO coordination setting where multiple EVCSs operate under partial PDN visibility, addressing the gap between theoretical MARL and practical deployment constraints.
TL-MAPPO Algorithm: Proposed a novel hybrid framework integrating:
- Lagrangian regularization for principled, hard-constraint handling (voltage safety).
- Transformer embeddings to enhance temporal context understanding under limited grid visibility.
Empirical Validation: Demonstrated significant improvements over state-of-the-art baselines (MAPPO, MATD3, MASAC) in a realistic 33-bus PDN simulation.

4. Experimental Results

The framework was tested on an IEEE 33-bus system with 4 EVCSs (10 chargers each) over a 24-hour horizon (288 time steps).

Voltage Safety: TL-MAPPO reduced voltage violations by approximately 45% compared to the best baseline (MAPPO). It maintained voltages within the safe range (0.95–1.05 p.u.) almost exclusively, whereas baselines frequently caused undervoltage issues, particularly at downstream buses.
Economic Efficiency: The method reduced operational costs by approximately 10% (from ~140 AUD to ~133.5 AUD per day) while maintaining lower battery cycling overhead.
Service Quality: It achieved the lowest demand dissatisfaction (unmet charging needs), reducing it by up to 35% compared to baselines.
Stability: The learning curves showed faster convergence and narrower confidence intervals, indicating higher training stability and robustness.

5. Significance

This work is significant for the practical deployment of AI in power systems for several reasons:

Bridging the Visibility Gap: It proves that effective, safe coordination is possible even when VPPs lack full grid topology data, a common real-world constraint.
Safety-Centric AI: By moving beyond simple reward penalties to Lagrangian regularization, the paper provides a more rigorous mathematical guarantee for enforcing physical grid constraints (voltage limits) in RL agents.
Temporal Awareness: The integration of Transformers demonstrates that capturing long-term temporal dependencies is crucial for managing the stochastic nature of EV charging and renewable generation.
Scalability: The decentralized execution model ensures that the system can scale to large networks without requiring a central controller to process real-time data from every node, making it suitable for future smart grid applications.