A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA

Here is an explanation of the paper, translated into everyday language with some creative analogies.

The Big Problem: The "Endless Backpack"

Imagine you are trying to write a story, but every time you add a new sentence, you have to carry your entire previous story with you in a backpack to remember what you wrote.

The Old Way (Standard AI): As the story gets longer, the backpack gets heavier and heavier. Eventually, you can't carry it anymore. In computer terms, this is the "KV Cache." Every time the AI generates a new word, it has to go to the main memory (a giant warehouse) to grab the whole backpack, do some math, and put it back.
The New Way (Gated DeltaNet): Scientists invented a smarter way. Instead of carrying the whole story, the AI just keeps a tiny, fixed-size notebook (about 2 Megabytes) that summarizes the story. No matter how long the story gets, the notebook stays the same size. This is much lighter!

The Catch: Even though the notebook is small, the current computers (GPUs) are so fast at thinking that they spend almost all their time just running back and forth to the warehouse to fetch the notebook. They are "memory-bound." It's like having a Formula 1 race car stuck in traffic because the driver has to walk to the gas station for every drop of fuel.

The Solution: The "On-Board Library"

The researchers built a special chip (an FPGA) that solves this traffic problem.

The Analogy:
Imagine the AI is a chef in a kitchen.

The GPU Chef: The chef is incredibly fast at chopping vegetables (doing math), but the ingredients are in a storage room across the street. The chef spends 90% of their time running to the storage room and back, and only 10% actually chopping.
The FPGA Chef: This chef built a personal pantry right next to the cutting board. They put the entire 2MB "notebook" (the recurrent state) inside this pantry. Now, the chef never has to leave the kitchen. They can grab ingredients instantly.

Because the chef never stops to run to the storage room, the kitchen becomes incredibly efficient. The work changes from being "limited by how fast you can run" to "limited by how fast you can chop."

How They Did It (The Magic Tricks)

The paper describes three main "magic tricks" they used to make this work:

The "One-Trip" Rule:
Normally, to update the notebook, the chef has to read a page, do some math, write a new page, read it again to check, and then write the final result. That's three trips per page.
The researchers rearranged the math (algebra) so the chef only needs to read the page once and write it once. It's like doing your homework while you read the textbook, instead of reading the whole book, closing it, and then trying to remember what to write.
The "Twin-Head" Team:
The AI processes information in groups. Usually, it handles one group at a time. The researchers realized that two groups often share the same "questions" and "keys." So, they built a system where two chefs work side-by-side using the same set of instructions but writing in their own separate notebooks. This doubles the speed without needing double the space.
The Assembly Line:
Instead of waiting for the whole notebook to be updated before starting the next word, they built an assembly line. While the chef is writing the current word, the next chef is preparing the ingredients for the next word, and the third chef is packaging the finished word to send out. Everything happens at the same time.

The Results: Speed and Savings

They tested this new "kitchen" (the FPGA accelerator) against a top-of-the-line GPU (the NVIDIA H100).

Speed: The FPGA was 4.5 times faster at generating each word.
Energy: This is the biggest win. The GPU uses a lot of electricity (like a 350-watt heater) just to run the kitchen. The FPGA chip uses less than 10 watts (about the same as a bright lightbulb).
Efficiency: Because it's so fast and uses so little power, the FPGA is 60 times more energy-efficient per word generated.

Why This Matters

As Artificial Intelligence gets smarter and more complex, the cost of running it is becoming a huge problem. This paper shows that by changing the hardware (the chip) to match the software (the new "notebook" algorithm), we can make AI:

Faster: So you don't have to wait for answers.
Cheaper: Because it uses way less electricity.
Greener: Drastically reducing the carbon footprint of running AI models.

In short: They took a fast car stuck in traffic, built a private highway right next to the engine, and suddenly, the car is flying.

Here is a detailed technical summary of the paper "A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA."

1. Problem Statement

The paper addresses the inference bottleneck in Gated DeltaNet (GDN), a linear attention mechanism used in hybrid Large Language Models (LLMs) like Qwen3-Next.

The Bottleneck: While GDN reduces memory complexity from $O(n)$ to $O(1)$ by replacing the growing Key-Value (KV) cache with a fixed-size recurrent state, the decode phase remains severely memory-bound on GPUs.
The Cause: At batch size 1, the model must read and write the entire recurrent state matrix (approx. 2 MB for a standard configuration) through High Bandwidth Memory (HBM) for every single token generated.
Arithmetic Intensity: The arithmetic intensity of GDN decode is extremely low (~0.87 FLOP/Byte), far below the "ridge point" of modern GPUs (e.g., NVIDIA H100 at 25.6 FLOP/Byte). This means performance is limited by memory bandwidth, not compute power.
Limitation of GPUs: Even with optimized kernels, GPUs cannot eliminate the fundamental cost of round-tripping the state through off-chip HBM for every token.

2. Methodology

The authors propose an FPGA-based dataflow accelerator that fundamentally changes the memory hierarchy to eliminate the off-chip bottleneck.

A. Persistent On-Chip State

Insight: The full 2 MB GDN state fits comfortably within the 17.6 MB of on-chip Block RAM (BRAM) available on modern FPGAs (specifically the AMD Alveo U55C).
Strategy: Instead of loading the state from HBM every token, the accelerator holds the entire 2 MB recurrent state persistently in on-chip BRAM.
Result: This converts the workload from memory-bound to compute-bound, eliminating the massive HBM round-trip (reducing off-chip I/O from ~4.2 MB to ~48.5 KB per token).

B. Fused Five-Phase Pipelined Datapath

To minimize the number of passes over the state matrix, the authors algebraically restructure the GDN recurrence equations:

Naive Approach: Requires three passes over the state matrix per token (Retrieval, State Update, Output).
Optimized Approach: By fusing the output computation with the state update, the design reduces this to two passes (one read, one write).
- Equation Transformation: The output $o_t$ is computed as $o_t = \frac{1}{\sqrt{d}}(g \cdot S_{t-1}^T q_t + (q_t^T k_t)\Delta v_t)$ . This allows the partial output to be calculated during the retrieval read pass, avoiding a second read of the updated state.
Pipeline: The design implements a five-phase pipeline (Prepare, Retrieve/Partial-Output, Delta Correction, Output Correction, State Update) that achieves an Initiation Interval (II) of 1 cycle.

C. GVA-Aware Parallelism

Architecture: Qwen3-Next uses Grouped Value Attention (GVA) where one Query/Key pair serves two Value heads.
Optimization: The accelerator exploits this by processing paired Value heads simultaneously. The Query and Key vectors are broadcast to both heads, while each head maintains its own state matrix and accumulators. This doubles compute efficiency without duplicating Q/K storage.

D. Dataflow Scheduling

The system uses a dataflow architecture to overlap three stages across iterations:
1. Prepare: Loading inputs and computing scalar gates.
2. Compute: Executing the fused pipeline on a group of heads ( $H_{iter}$ ).
3. Store: Writing outputs to external memory.
This overlapping ensures that the pipeline interval is determined solely by the compute stage, maximizing throughput.

3. Key Contributions

First FPGA Accelerator for GDN: The first implementation of autoregressive Gated DeltaNet decode on FPGA, leveraging persistent on-chip state to eliminate off-chip memory bottlenecks.
Algebraic Restructuring: A novel fusion of the GDN recurrence that reduces state matrix access from three passes to two, significantly lowering latency.
GVA Exploitation: A parallelism strategy that processes paired Value heads simultaneously, scaling throughput without increasing the pipeline interval.
Design Space Exploration: Evaluation of four configurations ( $H_{iter} = 2, 4, 8, 16$ ) on the AMD Alveo U55C to find the optimal balance between parallelism and routing congestion.

4. Results

The accelerator was evaluated against an NVIDIA H100 PCIe running the official PyTorch GDN implementation.

Latency:
- The optimal configuration ( $H_{iter}=8$ ) achieves 63.2 µs per token.
- This is 4.5× faster than the H100 GPU baseline (285 µs).
- Even a conservative implementation ( $H_{iter}=2$ at 263 MHz) achieves 161.7 µs, still outperforming the GPU.
Energy Efficiency:
- The FPGA design consumes only 9.96 W of on-chip power (at 263 MHz).
- This yields an energy efficiency of ~60× better than the GPU (1.61 mJ/token vs. 99.8 mJ/token).
- Even assuming the full 150 W board TDP, the FPGA remains 7.6×–10.5× more efficient.
Resource Utilization:
- The optimal design ( $H_{iter}=8$ ) utilizes only 25% of the FPGA's BRAM, DSP, and LUT resources.
- Higher parallelism ( $H_{iter}=16$ ) caused pipeline interval inflation due to routing congestion, actually increasing latency.

5. Significance

Architectural Shift: The paper demonstrates that the memory bottleneck in subquadratic LLMs is architectural, not algorithmic. By moving the state to on-chip memory, the performance ceiling is raised significantly.
Production Relevance: As hybrid models (like Qwen3-Next) increasingly adopt GDN layers (3:1 ratio with attention), efficient decode acceleration becomes critical for real-time inference.
Energy Efficiency: The massive reduction in energy per token (60×) makes this approach highly viable for edge and datacenter deployments where power is a constraint.
Future Directions: The authors note that this persistent-state datapath can be extended to support prefill, mixed-precision quantization, and sparse Mixture-of-Experts (MoE) routing, paving the way for full-scale hybrid LLM inference on a single FPGA.