Six Times to Spare: Characterizing GPU-Accelerated 5G LDPC Decoding for Edge-RSU Communications

The Big Picture: The Traffic Jam at the Edge of the City

Imagine a busy city intersection where self-driving cars are zooming by. To keep them safe, they need to talk to a Roadside Unit (RSU)—a smart traffic box sitting on a pole. This box has to do two things at once:

Talk to the cars: It has to decode thousands of complex radio messages instantly to tell cars when to stop or go.
Think for the city: It also has to run traffic lights, analyze camera feeds, and coordinate with other cars.

The problem? The "talking" part (decoding radio messages) is so heavy that it might crush the computer's brain, leaving no energy left for the "thinking" part. If the computer gets overwhelmed, the self-driving cars might crash.

This paper asks: Can we give this traffic box a super-charged assistant (a GPU) to handle the heavy talking, so the main brain (the CPU) stays fresh for the thinking?

The answer is a resounding yes, but with a catch: the assistant only works best when there is a lot of work to do.

The Cast of Characters

The CPU (The General Manager): This is the standard computer brain. It's great at doing many different small tasks (like managing traffic lights, running apps, and organizing files). But when asked to decode thousands of radio messages at once, it gets tired and slow.
The GPU (The Assembly Line): This is a specialized chip (like in gaming computers or AI servers). It's not great at doing one tiny thing, but it is a monster at doing the same thing thousands of times simultaneously. Think of it as a factory with 10,000 robots working in perfect sync.
The LDPC Decoder (The Translator): This is the specific job of translating the garbled radio noise into clear instructions. It's the most exhausting part of the job.
The "Slot Budget" (The Time Limit): In 5G, messages must be decoded within a tiny fraction of a second (about 0.5 milliseconds). If you miss this deadline, the message is lost, and the car might not brake in time.

The Experiment: The "Six Times to Spare" Test

The researchers built a simulation to see how fast the CPU and GPU could decode these messages. They tested two types of computers:

The "Workstation" (The Big Truck): A powerful desktop computer with a massive, separate graphics card. It's fast but eats a lot of electricity and doesn't fit on a street pole.
The "Edge Node" (The Compact Van): A new, tiny, all-in-one computer (NVIDIA DGX Spark) designed to fit on a street pole. It has the CPU and GPU built into the same chip, sharing memory like roommates sharing a fridge.

The Three Zones of Performance

The researchers found that the GPU doesn't win in every situation. They discovered three distinct zones:

1. The "Empty Street" Zone (Small Batches)

Scenario: Only 1 or 2 cars are talking to the box.
Result: The CPU wins.
Analogy: If you only have one package to deliver, it's faster to just walk it to the door yourself (CPU) than to fire up a massive delivery truck, drive it to the warehouse, load it, and drive it back (GPU). The GPU spends too much time "warming up."

2. The "Ramp-Up" Zone (Medium Batches)

Scenario: A few dozen cars are talking.
Result: The GPU starts to catch up.
Analogy: As more packages arrive, the truck starts to make sense. The more packages you have, the more efficient the truck becomes compared to the person walking.

3. The "Dense Traffic" Zone (Large Batches)

Scenario: A rush hour! Hundreds of cars are screaming for instructions at the exact same time.
Result: The GPU dominates.
The "Six Times" Discovery: In this heavy traffic, the GPU on the compact "Edge Node" was 6 times faster than the CPU.
- The CPU tried to decode the messages and used up 100% of its time budget (it was late!).
- The GPU did the exact same job in only 25% of the time budget.

Why "Six Times to Spare" Matters

The title "Six Times to Spare" refers to the extra time the system gains.

Imagine the traffic box has a strict 1-second deadline to finish all its work.

Without the GPU: The CPU spends 1.5 seconds decoding messages. It's late. The system fails.
With the GPU: The GPU finishes decoding in 0.25 seconds.
The Result: You now have 0.75 seconds of "spare time" (or "headroom").

This spare time is gold. It allows the Roadside Unit to:

Handle sudden spikes in traffic (like a parade or an accident).
Run complex AI to see around corners (cooperative perception).
Manage the traffic lights without crashing.

The "Secret Sauce": Coherent Memory

The paper also highlights a clever design trick in the new "Edge Node" (DGX Spark).

Old Way (Workstation): The CPU and GPU are like two people in different rooms. To share data, they have to shout through a door (PCIe cable). This takes time and energy.
New Way (Edge Node): The CPU and GPU are in the same room, sharing a single whiteboard (Shared Memory). They can grab data instantly without shouting.

This design means the compact Edge Node doesn't just get faster; it gets more efficient because it doesn't waste energy moving data back and forth.

The Bottom Line

This paper proves that for self-driving cars to be safe, the computers on the street poles need a GPU assistant.

When traffic is light, the main computer can handle it. But when the city gets busy (which is when safety matters most), the GPU steps in and does the heavy lifting 6 times faster than the main computer could alone. This frees up the main computer to do the smart thinking, ensuring that self-driving cars can communicate instantly and safely, even during the busiest rush hours.

In short: The GPU doesn't just make things faster; it buys the system enough "spare time" to keep everyone safe.

1. Problem Statement

Context: Ultra-Reliable Low-Latency Communications (URLLC) for autonomous vehicles and Vehicle-to-Everything (V2X) systems require strict timing constraints at the network edge (Roadside Units - RSUs, and compact gNBs). These edge nodes must simultaneously handle latency-sensitive physical layer (PHY) processing and compute-intensive higher-layer services (e.g., cooperative perception, coordination).

The Bottleneck: In 5G New Radio (NR), Low-Density Parity-Check (LDPC) decoding is a dominant, iterative compute kernel. Its cost scales with both the number of parallel codewords and the belief-propagation iteration budget. On general-purpose CPUs, this workload often consumes excessive processing time, threatening to exceed the strict slot budgets (e.g., 0.5 ms) required for HARQ feedback and scheduling, especially under heavy parallel loads.

The Gap: While GPU acceleration for LDPC is known, there is a lack of reproducible, system-oriented studies characterizing decoder behavior on compact heterogeneous edge platforms (like NVIDIA DGX Spark) under heavy parallel workloads. Most existing studies focus on high-end data centers or lack telemetry-backed insights into resource contention and "compute headroom."

2. Methodology

The authors developed a reproducible, telemetry-backed microbenchmark to isolate and measure LDPC decoding performance across different hardware architectures.

Benchmark Source: Derived from the Sionna LDPC5G baseline, implementing a 5G-compliant link-level PHY chain (Rate-1/2 LDPC, 16-QAM, AWGN channel).
Workload Model:
- Batch Parallelism ( $N_{cw}$ ): Varied from 1 to 20,480 codewords.
  - Baseline Regime: 1–1,024 codewords (powers of 2) to study launch overhead and ramp-up.
  - Dense Regime: 2,048–20,480 codewords (increments of 2,048) to simulate steady-state edge provisioning.
- Iteration Budget ( $I$ ): Swept from 4 to 22 belief-propagation iterations.
Hardware Platforms:
1. Edge Target: NVIDIA DGX Spark (Compact SoC). Features a 20-core Arm Grace CPU and an on-package Blackwell-class GB10 GPU. Uses 128 GB LPDDR5x shared memory via NVLink-Chip-to-Chip (C2C), eliminating the PCIe bottleneck.
2. Reference System: COTS Workstation (Intel i9-14900K CPU + NVIDIA RTX 4090 GPU). Uses discrete PCIe connection with separate RAM and VRAM.
Metrics:
- Throughput ( $T_{thr}$ ): Bits per second.
- Amortized Service Time ( $t_{cb}$ ): Batch latency divided by the number of codewords ( $t_{dec}/N_{cw}$ ), representing effective parallel service time.
- Telemetry: CPU/GPU utilization, active core equivalents, and power consumption.

3. Key Contributions

Reproducible Microbenchmark: A controlled, telemetry-backed framework for comparing CPU vs. GPU LDPC decoding on edge-relevant hardware, isolating compute from stochastic channel variations.
Characterization of Operating Regimes: Identification of three distinct regions:
- Launch-limited: CPUs are competitive at very small batch sizes due to GPU overhead.
- Ramp-up: GPU efficiency improves as batch size increases.
- Dense Steady-State: The target regime for edge provisioning where accelerators stabilize.
Architectural Insight: A comparative analysis showing how coherent memory architectures (DGX Spark) differ from discrete PCIe architectures (COTS) in handling large working sets, specifically regarding CPU orchestration overhead.
"Six Times to Spare" Metric: Quantification of the compute headroom gained by offloading LDPC decoding to GPUs in compact edge nodes.

4. Key Results

A. Throughput and Speedup

Small Batches: At $N_{cw} < 16$ , CPUs (Grace or i9) are often faster or competitive with GPUs due to launch overhead.
Dense Regime ( $N_{cw} \ge 2048$ ):
- DGX Spark: The GB10 GPU achieves a stable ~5.8× throughput speedup over the Grace CPU across all iteration counts.
- COTS Workstation: The RTX 4090 achieves a higher absolute speedup (~15×), but this comes with significant power and architectural trade-offs.
Iteration Sensitivity: The speedup factor remains remarkably stable (~5.7×–5.8×) even as the iteration count increases from 4 to 22, indicating the GPU handles increased algorithmic complexity efficiently.

B. Amortized Service Time (The "Six Times" Claim)

The most critical finding relates to the 0.5 ms NR slot budget:

CPU-Only (Grace): At high iterations ( $I=20$ ), decoding consumes 145% of the slot budget, making it impossible to meet URLLC deadlines. Even at $I=4$ , it consumes 31%.
GPU-Offloaded (GB10): At $I=20$ , decoding consumes only 25% of the slot budget. At $I=4$ , it consumes just 5%.
Conclusion: GPU offload transforms LDPC decoding from a "slot-dominating" bottleneck into a manageable component, leaving substantial headroom for other PHY tasks and higher-layer applications.

C. Resource Rebalancing & Power

CPU Relief: Offloading shifts the workload from ~10–12 active CPU cores to the GPU, freeing up the general-purpose CPU for control-plane logic, scheduling, and cooperative perception.
Power Efficiency:
- DGX Spark: High-occupancy decoding adds only ~18–20W to the baseline (rising from ~12W to ~30W).
- COTS: The RTX 4090 consumes significantly more power, rising from ~62W to ~220W under load, making it less suitable for power-constrained edge deployments despite higher raw speed.

D. Architectural Anomalies (COTS vs. Edge)

The COTS system showed a "sweet spot" peak at $N_{cw} \approx 1024$ followed by a decline in efficiency at larger batches. Telemetry revealed this was caused by CPU orchestration overhead and system RAM $\leftrightarrow$ VRAM transfer costs across the PCIe bus.
The DGX Spark, with its coherent NVLink-C2C memory, avoided this penalty, showing a smoother ramp-up to steady-state performance.

5. Significance and Implications

Feasibility of Edge URLLC: The study proves that compact heterogeneous edge nodes (like DGX Spark) can meet strict URLLC timing budgets for LDPC decoding under heavy parallel loads, provided GPU offload is utilized.
Compute Headroom: The "Six Times to Spare" result indicates that GPU acceleration provides a massive safety margin. This allows edge RSUs to handle bursty traffic (e.g., at intersections) without dropping packets or violating latency constraints.
System Design Guidance:
- Memory Architecture Matters: For edge deployments, coherent memory (SoC) is superior to discrete PCIe GPUs for predictable PHY performance, as it eliminates host-device transfer bottlenecks.
- Resource Allocation: Offloading PHY tasks to accelerators is not just about speed; it is about resource rebalancing. It frees up scarce CPU cycles for critical control-plane and AI/ML tasks (e.g., sensor fusion) that also run on the edge node.
Deployment Reality: While high-end workstations offer higher absolute throughput, they are not representative of edge constraints. The DGX Spark results provide the realistic upper bound for compact, power-efficient vehicular infrastructure.

In summary, the paper demonstrates that GPU acceleration on compact edge hardware is essential for 5G/6G URLLC, providing a consistent ~6x performance margin that ensures reliable, low-latency decoding while preserving CPU resources for the broader edge ecosystem.