Kraken: Higher-order EM Side-Channel Attacks on DNNs in Near and Far Field

Imagine you've spent millions of dollars and years of hard work training a brilliant AI assistant. You keep its "brain" (the mathematical weights) locked away in a secure server, thinking it's safe. You believe that as long as people can't see the code or touch the computer, your secret is safe.

This paper, titled "Kraken," says: "Not so fast."

The researchers discovered that even if you can't touch the computer, the AI's brain is still "screaming" its secrets through invisible waves. They demonstrated that by listening to these waves from a distance—even through a glass window—they could steal the AI's brain.

Here is the breakdown of their discovery using simple analogies:

1. The Problem: The AI's "Whisper"

When an AI (like a Large Language Model) thinks, it performs billions of math calculations. Every time it does math, the computer chip (specifically the GPU) uses a tiny bit of energy and emits a tiny burst of electromagnetic radiation (like a radio wave).

The Analogy: Imagine a secret agent doing math in a room. Every time they add two numbers, they tap their foot. If you stand outside the room, you can't see them, but if you have a super-sensitive microphone, you can hear the rhythm of their foot taps. By listening to the taps, you can figure out exactly what numbers they are adding.

2. The Old Way vs. The New "Kraken" Way

Previous hackers tried to steal AI models by getting very close to the computer (like putting a microphone right against the wall). They also mostly targeted the "general purpose" parts of the chip.

The Kraken team did two revolutionary things:

They listened to the "Special Forces": Modern AI chips have special, super-fast engines called Tensor Cores. These are the heavy lifters used by big AI models. The researchers figured out how to listen specifically to these engines, which are much louder and more efficient than the old parts.
They listened from the "Far Field": This is the big shocker. They proved you don't need to be right next to the machine. You can stand 100 centimeters (about 3 feet) away, even with a glass window in between, and still hear the secrets.

3. How They Did It: The "Warp" and the "Squad"

GPUs work by processing data in groups.

The Analogy: Imagine a construction crew (the GPU). They don't work one by one; they work in squads of 32 people called Warps.
The Old Mistake: Previous hackers tried to listen to just one worker in the squad. It was like trying to hear one person whisper in a crowded stadium. It was very hard and required millions of recordings to make sense.
The Kraken Solution: The researchers realized that all 32 workers in a squad are doing similar math at the exact same time. Instead of listening to one person, they listened to the entire squad's combined noise.
- Result: It's like listening to the whole choir instead of one singer. The signal is much louder and clearer, making the theft much faster and easier.

4. The "Higher-Order" Trick

Sometimes, the same secret number (a weight) is used in many different calculations throughout the AI's thinking process.

The Analogy: Imagine a spy using the same password to open three different doors. If you only watch the first door, you might miss the pattern. But if you watch all three doors and combine the clues, the pattern becomes obvious.
The Kraken Solution: They combined the "noise" from multiple different moments in time. By stacking these clues together (a technique called Higher-Order Attacks), they could crack the code much faster than before.

5. The Glass Wall Experiment

To prove how dangerous this is, they set up a real-world test:

They put a powerful AI (Llama 3.2) on a high-end graphics card (RTX 4090).
They placed a radio antenna 1 meter away, separated by a glass pane.
The Result: The glass didn't stop the signal. The antenna picked up the electromagnetic "whispers" clearly enough to start reconstructing the AI's brain.

6. Why Should You Care?

Intellectual Property is at Risk: Companies spend millions training these models. If a thief can stand outside a building and steal the model through the window, that investment is gone.
It's Not Just "Theory": This isn't just math on a whiteboard. They actually did it. They showed that even with modern, fast chips, the physical laws of electricity mean that "air-gapped" (physically isolated) systems aren't as safe as we thought.
The Fix: The paper suggests that to stop this, companies might need to put their AI servers inside metal boxes (Faraday cages) to block these radio waves, or use other physical shielding, because software tricks alone might not be enough.

Summary

The Kraken paper is a wake-up call. It tells us that in the age of AI, physical security matters more than ever. Just because you can't touch the computer doesn't mean it's safe; the computer is constantly broadcasting its secrets, and with the right equipment, a thief can listen in from across the room.

Here is a detailed technical summary of the paper "Kraken: Higher-order EM Side-Channel Attacks on DNNs in Near and Far Field."

1. Problem Statement

Deep Neural Networks (DNNs), particularly Large Language Models (LLMs), represent significant Intellectual Property (IP) investments, often costing millions of dollars to train. While "model stealing" via API queries is a known threat, it is often mitigated by rate limiting, quantization, or noise injection.

Physical side-channel attacks (SCA) offer a more direct route to theft by analyzing power consumption or electromagnetic (EM) radiation. However, existing SCA research on GPUs has significant limitations:

Target Limitations: Previous work primarily targeted general-purpose CUDA Cores, ignoring the specialized Tensor Cores that drive modern high-performance DNNs (including LLMs).
Model Limitations: Existing leakage models often analyze single threads, failing to account for the massive parallelism of GPU warps (groups of 32 threads), leading to inefficient attacks.
Distance Limitations: Most physical attacks require near-field access (often requiring invasive modifications like removing heat sinks). There is a lack of evidence regarding far-field EM leakage (e.g., through glass or at distances >1 meter) for LLMs.
Data Efficiency: Extracting weights from complex models usually requires millions of traces, making attacks impractical.

2. Methodology

The authors propose a comprehensive framework combining hardware analysis, refined leakage modeling, and higher-order statistical techniques to extract model weights via EM radiation.

A. Hardware Analysis & Floorplanning

Target: NVIDIA Jetson Orin Nano (for near-field) and RTX 4090 (for far-field).
Technique: Used infrared imaging to create a die floorplan, identifying the Streaming Multiprocessors (SMs) and their sub-partitions (containing register files and ALUs).
Insight: The EM probe is placed directly over sub-partitions where Tensor Core instructions write results to registers, maximizing signal-to-noise ratio.

B. Leakage Modeling Innovations

The paper introduces two novel leakage models to overcome the noise of parallel execution:

Warp-Level Leakage Model:
- Instead of modeling a single thread, the attack aggregates the power/EM consumption of all 32 threads in a warp.
- Since threads in a warp execute the same instruction (SIMT) on different data simultaneously, their combined energy consumption provides a much stronger signal than individual threads.
- This model specifically targets the IMMA (Integer Matrix Multiply) and HMMA (Half-Precision Matrix Multiply) instructions used by Tensor Cores.
Higher-Order Leakage Model:
- Exploits the fact that a single weight is used in multiple dot products across different inputs and time steps within a neural network layer.
- Technique: Combines leakage from multiple intermediate values (e.g., results from different IMMA instructions) using a combining function (e.g., sum of power consumptions) and a preprocessing function (e.g., squaring the sum).
- Goal: Reduces the number of traces required by leveraging the correlation of the same weight across multiple operations, similar to techniques used to break cryptographic masking.

C. Far-Field Attack Setup

Threat Model: An attacker observes EM radiation from a distance (25 cm to 100 cm) through obstacles (glass).
Hardware: Uses a Vivaldi antenna and Software Defined Radio (SDR) tuned to the GPU's core clock frequency (~2.565 GHz).
Signal Processing: Uses Zero-IF sampling to downconvert the high-frequency signal to baseband, isolating the amplitude modulation ( $A(t)$ ) which carries weight-dependent information.

D. Complexity Reduction for LLMs

Challenge: LLM weights (e.g., bfloat16) are often part of dot products involving multiple weights (e.g., 8 weights), creating a high-complexity search space ($2^{128}$).
Solution: The authors propose a fixed-input strategy. By fixing all inputs except one, the attacker isolates the leakage of a single weight.
Optimization: They further reduce complexity by noting that quantized weights in specific layers (like Llama 3.2) fall within a narrow range, reducing the candidate space from 16 bits to ~12.3 bits per weight.

3. Key Contributions

First Tensor Core Attacks: The first demonstration of parameter extraction targeting Tensor Cores (specifically IMMA/HMMA instructions) rather than CUDA Cores.
Warp-Level Modeling: A new leakage model that aggregates energy consumption across an entire warp, significantly improving attack efficiency over single-thread models.
Higher-Order SCA on DNNs: The first application of higher-order side-channel attacks to DNNs, combining multiple intermediate values to accelerate convergence.
Far-Field LLM Extraction: A proof-of-concept demonstrating that LLM weights can be leaked via EM radiation at 100 cm through a glass obstacle, a scenario previously considered unfeasible.
Quantization Leakage Analysis: An analysis showing that quantization schemes (many-to-one mapping) inherently leak information about weight ranges even before side-channel attacks are launched.

4. Experimental Results

Near-Field (Jetson Orin Nano)

Efficiency: The warp-level model reduced the required traces from millions (in previous work like BarraCUDA) to approximately 100,000–300,000.
Higher-Order Impact: Combining 2 or 3 warp-level correlations further reduced the trace count. With 3 combined correlations, the key rank reached 0 (perfect extraction) in just 10,000 traces.
Accuracy: Successfully extracted all 288 weights of a 2-layer CNN.

Far-Field (RTX 4090 with Llama 3.2 1B)

Distance: Successfully observed weight-related leakage at 25 cm and 100 cm.
Obstacles: Leakage was detectable even with glass between the antenna and the GPU, though signal amplitude was reduced.
Extraction: Demonstrated the extraction of the 8th weight in LoRA matrices ( $W_q, W_k, W_v$ ) using 2 million traces (4,000 inputs $\times$ 500 traces/input).
Observations:
- The GPU's boost clock (2.565 GHz) leaked information even when the GPU dynamically switched to higher frequencies (2.865 GHz) under load.
- Batch size and token generation count were clearly visible in the EM traces.

5. Significance and Implications

Threat to IP: This work proves that even specialized, high-performance hardware (Tensor Cores) and modern models (LLMs) are vulnerable to physical theft, even from a distance.
Shift in Defense: Traditional API-based defenses (rate limiting) are ineffective against physical side-channel attacks.
Hardware Design: The findings suggest that current GPU architectures leak significant information through EM radiation, necessitating hardware-level countermeasures (e.g., better shielding, noise injection at the die level) rather than just software masking.
Feasibility: While full model extraction in the far field remains computationally expensive, the proof-of-concept establishes a critical vulnerability. The near-field attacks are now highly efficient, posing an immediate threat to any unshielded GPU running proprietary models.

6. Limitations & Future Work

Far-Field Practicality: While leakage is observable in the far field, full model extraction requires massive computational resources and trace collection, making it currently a proof-of-concept rather than a trivial attack.
Chosen Inputs: The most efficient attacks rely on chosen inputs (setting specific inputs to zero/fixed). In real-world LLM deployments where inputs are user-generated and unpredictable, this is harder to achieve without fault injection (e.g., laser attacks).
Countermeasures: The authors suggest that metallic shielding is the most effective defense for far-field attacks, while masking and shuffling are viable for near-field scenarios.