Benchmarking Dataset for Presence-Only Passive Reconnaissance in Wireless Smart-Grid Communications

Here is an explanation of the paper, translated from technical jargon into a story you can picture in your mind.

The Big Picture: The "Silent Stalker" in the Smart Grid

Imagine a Smart Grid as a giant, high-tech nervous system for a city. It connects millions of devices: your smart meter at home, the streetlights, the power substations, and the control centers. These devices talk to each other constantly using invisible radio waves (like Wi-Fi) and power lines to make sure electricity flows smoothly.

For years, cybersecurity experts have been worried about loud, active attackers. Think of these as hackers who break into the system, shout lies, steal data, or jam the radio signals to cause chaos. We have good tools to catch them.

But this paper is worried about a silent stalker.

Imagine a spy standing right next to a radio tower. The spy isn't shouting, isn't hacking, and isn't sending any messages. They are just standing there. However, their body blocks some of the radio waves, and their presence changes how the signal bounces around. To the devices talking to each other, the signal suddenly sounds a little "muffled" or "fuzzy," even though no one touched the equipment.

This paper asks: Can we detect a spy just by noticing that the air around the radio waves feels slightly different?

The Problem: We Didn't Have a "Training Gym"

To teach a computer to spot this silent stalker, you need a gym where you can practice. But in the real world, you can't easily set up a spy next to a power grid to see what happens. And existing data sets only show us "loud" attacks (like hackers sending fake messages). They don't show us the subtle "muffled" signals caused by a person just standing nearby.

So, the authors built a virtual training gym.

The Solution: A Virtual Smart Grid Simulator

The authors created a computer program that generates a fake, but incredibly realistic, Smart Grid. Here is how they built it, using simple analogies:

1. The Neighborhood (The Topology)

Imagine a three-layered neighborhood:

The Home (HAN): Your smart meter and Wi-Fi router.
The Block (NAN): The local street controllers and solar panels.
The City (WAN): The big power stations and control centers.

They created a map of 12 devices in this neighborhood. Some talk via Wi-Fi, some via Power Lines (PLC), and some via Fiber Optics (which are like glass pipes that light travels through).

2. The "Ghost" Attack (Passive Reconnaissance)

In this simulation, the "attacker" is a ghost.

The ghost doesn't touch anything.
The ghost doesn't send messages.
The ghost just stands near a wireless device.

When the ghost stands there, two things happen to the invisible radio waves:

Shadowing: The ghost blocks some signal, making it weaker (like a cloud blocking the sun).
Echoes: The ghost's body causes the signal to bounce weirdly, creating a "fuzzy" echo.

The computer then calculates: "If the signal is weaker and fuzzier, will the device make more mistakes? Will it take longer to reply?" The answer is yes. The "muffled" signal causes more dropped messages and slower replies.

3. The "Chain Reaction" (Physical Consistency)

This is the most important part. The authors didn't just randomly change the numbers to look like an attack. They built a physics engine.

Step 1: The ghost stands there $\rightarrow$ Signal gets weaker.
Step 2: Weaker signal $\rightarrow$ The computer calculates a lower "Signal-to-Noise" ratio (like trying to hear a whisper in a noisy room).
Step 3: Lower ratio $\rightarrow$ The computer calculates that more messages will fail (Packet Error).
Step 4: More failures $\rightarrow$ The device has to re-send messages, causing a delay (Latency).

Because the attack happens at the very bottom (the physics of the air), the changes ripple up to the top (the speed and reliability of the network). This makes the data realistic. It's not a fake flag saying "ATTACK HERE"; it's a natural consequence of physics.

Why This Matters: The "Leak-Safe" Rule

The authors were very careful to make sure the data didn't give away the answer too easily.

No Cheating: They didn't give the computer a "hint" (like a label saying "This is an attack").
No Shortcuts: They made sure the computer had to learn the pattern of the signal changes, not just look for a specific number.
Privacy Ready: They designed the data so that different computers can learn from their own local neighborhoods without sharing private data (Federated Learning). This is like neighbors teaching a security guard what a "normal" day looks like without showing each other their private cameras.

The Results: It's Harder Than It Looks

They tested some basic AI models on this new data.

The Result: The AI struggled. It could sometimes guess, but it often made mistakes.
The Lesson: Detecting a silent stalker is very hard. The changes are tiny and subtle. You can't just look at one second of data; you have to look at the history of the signal and how neighbors are behaving.

The Takeaway

This paper provides a new, realistic training manual for cybersecurity experts. It teaches them how to spot a spy who is just standing there, changing the air around the radio waves, without ever touching a single wire.

In short:

Old way: Catch the hacker who breaks the door down.
New way: Catch the spy who stands in the hallway and changes the temperature just enough to make the thermostat confused.
This paper: Built a virtual house to practice spotting that spy.

Here is a detailed technical summary of the paper "Benchmarking Dataset for Presence-Only Passive Reconnaissance in Wireless Smart-Grid Communications."

1. Problem Statement

The paper addresses a critical gap in smart-grid cybersecurity research: the lack of realistic benchmarks for presence-only passive reconnaissance.

The Threat: Unlike active attacks (e.g., false-data injection, jamming, or replay), a passive adversary is "receive-only." They do not transmit or manipulate packets. Instead, they physically position themselves near wireless links (HAN/NAN/WAN). Their mere presence alters the radio propagation environment through shadowing (attenuation) and multipath scattering, causing subtle, temporally correlated deviations in Channel State Information (CSI), Signal-to-Noise Ratio (SNR), and packet error rates.
The Gap: Existing smart-grid datasets focus on active attacks or high-layer protocol anomalies. They rarely provide propagation-layer observables (CSI/RSSI) with tiered topology context (HAN/NAN/WAN) suitable for evaluating detectors that must distinguish between natural channel fading and subtle, proximity-induced anomalies without "leaking" attack information through injected flags.

2. Methodology

The authors propose a synthetic benchmark dataset generator designed to be physically consistent and "leak-safe."

A. Topology and Network Model

Structure: A 12-node communication graph representing a tiered Smart Grid:
- HAN (Home Area Network): Smart meters and gateways (ZigBee, Wi-Fi).
- NAN (Neighborhood Area Network): Distributed Energy Resources (DERs), feeders, and controllers (LoRa, PLC, LTE).
- WAN (Wide Area Network): SCADA, PMUs, and substations (Fiber, LTE, PLC).
Constraints: The topology enforces IEEE 2030 standards (e.g., no direct HAN-to-WAN links). Fiber links are designated as "attack-ineligible" (always normal), while wireless/PLC links are eligible for perturbation.

B. Physical Channel Modeling

The generator uses a deterministic, causal chain to map physical phenomena to observable metrics, ensuring no "shortcuts" or injected labels:

Latent Fading: A complex Gauss-Markov process ( $h_i(t)$ ) models small-scale fading with technology-specific temporal correlation.
Shadowing & Interference: Large-scale fading follows 3GPP TR 38.901 log-normal statistics. Interference includes impulsive noise (modeled for PLC) and background noise.
Measurement Proxy: Latent amplitude is mapped to a measurable amplitude proxy ( $C$ ) with device-specific quantization (e.g., 1 dB for ZigBee) and bounded noise.
Link Metrics Chain: The system computes metrics sequentially:
- $C \rightarrow \text{SNR}$ (Signal-to-Noise Ratio)
- $\text{SNR} \rightarrow \text{PER}$ (Packet Error Rate) via a logistic "waterfall" curve.
- $\text{PER} \rightarrow \text{Latency}$ (including ARQ retransmission logic, jitter, and burst components).

C. Attack Generation (Presence-Only)

Attacks are modeled as propagation perturbations applied only during active transmission epochs:

Shadow Loss: Additional attenuation (shadowing) is added to the latent channel, simulating body blockage.
Coherence Degradation: The temporal correlation of the fading process is reduced, and innovation (scattering) is increased, mimicking the effect of a human moving near the link.
Ramp-Up: Perturbations are applied via a piecewise-linear ramp to avoid abrupt, unrealistic jumps.
Activity Gating: Attacks are only labeled and applied when tx_count > 0 to ensure physical realism (no perturbation on silent links).

D. Data Construction & Leak Safety

Split Independence: Train, validation, and test sets are generated with independent random seeds and burn-in periods to prevent data leakage.
Causal Features: Features are strictly causal (rolling windows, no future data).
Normalization: Per-node standardization parameters are fit only on the training set and applied to validation/test sets.
Federated Ready: The dataset provides per-node partitions and adjacency matrices to support Federated Learning (FL) and graph-temporal pipelines.

3. Key Contributions

Topology-Aware Benchmark: A 12-node HAN/NAN/WAN graph with heterogeneous technologies (ZigBee, Wi-Fi, LoRa, PLC, LTE, Fiber) and strict tier-aware connectivity constraints.
Strictly Passive Perturbation Model: Attacks are modeled solely as propagation changes (shadowing and coherence loss) without packet injection, replay, or modification. Link indicators are recomputed coherently from physical laws.
Leak-Safe Construction: The dataset eliminates "feature shortcuts" (e.g., injected attack flags) and ensures split independence, enabling rigorous evaluation of detection algorithms.
Temporal + Neighbor Context: The release includes adjacency-weighted neighbor aggregates and deviation features, facilitating topology-aware learning.
Federated-Ready Release: Includes per-node train/val/test splits with normalization metadata, specifically designed for centralized, local, and federated graph-temporal detection pipelines.

4. Results

The authors evaluated the dataset using Federated Baseline Detectors to demonstrate the difficulty of the task:

Subtlety of Attacks: Presence-only attacks are low-amplitude and technology-dependent.
Baseline Performance:
- Linear Models (Fed-LR): Achieved high recall (0.88) but low precision (0.39), indicating many false alarms when temporal consistency is ignored.
- Tree Ensembles (Fed-XGB): Improved precision (0.54) but struggled with subtle regimes.
- Recurrent Models (Fed-LSTM/GRNN): Showed better balance (F1 ~0.65–0.72), suggesting that temporal modeling is crucial.
Key Finding: Single-epoch decisions are insufficient for detecting stealthy presence-only attacks. Effective detection requires spatiotemporal pipelines that leverage both time-series consistency and neighbor context (graph structure).

5. Significance

Standardization: This work provides the first standardized benchmark for presence-only passive reconnaissance in smart grids, filling a void where previous datasets focused on active attacks.
Realism: By anchoring parameters in 3GPP standards, device datasheets (e.g., CC2420, SX1276), and measurement literature, the synthetic data is quantitatively defensible and physically consistent.
Security Implications: It highlights that adversaries can infer network activity or locate devices simply by observing channel perturbations, necessitating detection methods that go beyond static thresholds.
Research Enabler: The "leak-safe" design allows researchers to rigorously test Federated Learning and Graph Neural Networks (GNNs) for anomaly detection without the risk of overfitting to artificial artifacts.

In summary, this paper introduces a rigorous, physics-based synthetic dataset that enables the development and evaluation of advanced detection systems for stealthy, passive adversaries in smart-grid communications.