Data-Rate-Aware High-Speed CNN Inference on FPGAs

Imagine you are running a high-speed assembly line in a factory that builds custom cars (these cars are the "images" being analyzed by a computer). Your goal is to inspect every car as fast as possible using a team of specialized workers (the "hardware" on a chip called an FPGA).

This paper is about fixing a major bottleneck in how these factories are built.

The Problem: The "Bottleneck" in the Assembly Line

In the past, engineers built these factories in two main ways:

The "Super-Factory" approach: They hired a massive army of workers to inspect every single part of the car at the exact same time. This is incredibly fast, but if the car design changes (like a smaller engine or fewer wheels), you end up with 90% of your workers standing around doing nothing. It's a waste of money and space.
The "One-by-One" approach: They hired a small team that inspects cars one by one. This is efficient with space, but it's too slow for a high-speed factory.

The specific problem this paper tackles is Pooling and Striding. In computer vision, these are steps where the image gets "zoomed out" or simplified. Imagine you have a high-resolution photo of a crowd, and you shrink it down to half its size. Suddenly, you have half as many details to process.

If your factory was built to handle the "high-resolution" crowd, and then the image shrinks, your workers are left staring at empty space. They are underutilized. Previous solutions tried to fix this by changing the workers' schedules, but they were limited to processing just one car part at a time.

The Solution: The "Multi-Pixel" Smart Factory

The authors propose a new design that is Data-Rate Aware. Think of this as a factory that can instantly reconfigure its assembly line based on how many parts are actually arriving.

Here is the core innovation, explained through a metaphor:

The "Double-Shift" Worker
Imagine a worker who usually inspects one car door per second.

Old Way: If the factory slows down, the worker just sits there waiting for the next door.
New Way: The authors designed a worker who can inspect two doors at once (or even more, depending on the need).

But it's not just about hiring more workers. It's about smart scheduling.

When the image is huge (lots of data), the factory runs at full speed, processing many pixels (parts) simultaneously.
When the image shrinks (less data), the factory doesn't just idle; it reconfigures itself to process fewer pixels per second, but keeps the workers busy by adjusting how they share the workload.

How They Did It (The "Magic" Tricks)

The paper introduces a few clever tricks to make this work:

The "Compressor Tree" (The Efficient Stack):
Imagine you have 100 people adding numbers. Instead of having them all shout their answers to one person (which causes a traffic jam), they form a pyramid. Two people add their numbers, pass the result to the next level, and so on. This paper uses a mathematical trick to build these "pyramids" of calculations so that the factory uses fewer resources (like electricity and space) while staying fast.
The "Time-Traveling" Delay (The Conveyor Belt):
To process two pixels at once, the factory needs to make sure the right parts arrive at the right time. The authors figured out how to "delay" the arrival of certain parts on the conveyor belt so that when a worker grabs two parts, they are perfectly aligned. It's like a dance where everyone steps in perfect rhythm, even if they are holding different props.
The "Skip-Step" Strategy:
Sometimes, the factory needs to skip steps (like when an image is downsized). The authors realized that if you know you are going to skip a step, you don't need to build a worker for that specific step at all. You can just remove that part of the assembly line entirely, saving huge amounts of space.

The Results: Speed vs. Efficiency

The team tested this on a famous AI model called MobileNet (used in things like self-driving cars and phone cameras).

The Speed Demon: When they pushed the factory to its limit, they achieved 16,000 frames per second. That is like watching a movie in fast-forward so fast that you see 16,000 scenes in a single second. This is more than 3 times faster than the best previous designs.
The Efficiency Expert: When they slowed the factory down to save resources, they found that they could run the same model using 22% fewer workers (chips) and 15% less storage space than before.

The Bottom Line

Think of this paper as the blueprint for a chameleon factory.

When the workload is heavy, it expands to become a massive, high-speed super-factory.
When the workload is light, it shrinks down, reconfiguring its workers to stay busy without wasting space or energy.

This allows engineers to put incredibly complex AI brains onto a single, small chip (an FPGA) that can run efficiently whether the AI is looking at a tiny, blurry image or a massive, high-definition video. It's the difference between building a factory that only works on Tuesdays and one that adapts to work perfectly every single day.

Here is a detailed technical summary of the paper "Data-Rate-Aware High-Speed CNN Inference on FPGAs" by Tobias Habermann and Martin Kumm.

1. Problem Statement

Convolutional Neural Network (CNN) accelerators on Field-Programmable Gate Arrays (FPGAs) aim to achieve low latency and high throughput. While dataflow-based architectures (like FINN) allow for scalable implementation of large networks, they often suffer from underutilization when processing layers that reduce data volume, such as pooling layers or strided convolutions.

The Core Issue: In fully unrolled or fixed-rate designs, when a layer reduces the spatial resolution of the feature map (e.g., halving the width/height), the subsequent layers receive data at a lower rate. If the hardware is designed for the initial high data rate, it sits idle during these reduced-rate phases, leading to poor resource efficiency.
Limitations of Prior Work: Previous data-rate-aware approaches (specifically reference [11]) adapted layer parallelization to match the local data rate but were limited to single-pixel processing (one pixel per clock cycle). This limitation prevented them from achieving the maximum possible throughput on modern high-speed FPGAs.

2. Methodology

The authors propose an enhanced Continuous-Flow CNN accelerator architecture that supports multi-pixel processing while maintaining data-rate awareness.

A. Architectural Foundation

The design builds upon the Kernel Processing Unit (KPU) for convolutional layers and the Fully Connected Unit (FCU) for pointwise/fully connected layers.

Continuous Flow: The architecture ensures a continuous stream of data where all hardware units remain busy, avoiding the stalls common in static parallel designs.
Parameterization: The implementation is defined by two integer parameters, $j$ (number of input signals processed per cycle) and $h$ (number of neurons/kernels processed per cycle).

B. Mathematical Reformulation

The paper reformulates the selection of $j$ and $h$ to optimize hardware utilization:

Constraints:
- $j$ must be a divisor of the input channel count ( $d_{\ell-1}$ ) to avoid padding invalid data.
- $h$ must be a divisor of the output channel count ( $d_{\ell}$ ) to ensure synchronization across parallel units.
Optimization Goal: The system searches for a pair $(j, h)$ such that the ratio $j/h$ is the closest possible value to the required input data rate $r_{\ell-1}$ , without exceeding it. This avoids the rounding errors and underutilization found in previous methods.
Resource Efficiency: By selecting $h$ close to $d_{\ell}$ , the design minimizes the number of parallel units, allowing for the creation of large, resource-efficient compressor trees (instead of many small adder trees).

C. Multi-Pixel Processing Extension

The primary innovation is extending the architecture to process multiple pixels per clock cycle (e.g., 2 pixels).

Adapted KPU: The standard KPU is modified into a non-transposed version. Instead of buffering weighted partial results, it buffers input features once and shares them across multiple processing paths.
Delay Logic: To handle multiple pixels (e.g., $x_{n,0}$ and $x_{n,1}$ ), the input signals are delayed and routed specifically to multipliers so that all weights for a specific sliding window are calculated simultaneously.
Stride Handling: For strided convolutions (stride $s > 1$ ), the architecture intelligently skips invalid sliding windows. Some KPU designs may be entirely omitted if their output would always be invalid due to the stride, further saving resources.

3. Key Contributions

Multi-Pixel Continuous Flow: The first data-rate-aware architecture capable of processing multiple pixels per clock cycle, significantly boosting throughput potential.
Refined Parameter Selection: A new mathematical formulation for selecting layer parameters ( $j, h$ ) that guarantees divisibility constraints and minimizes hardware waste compared to previous rounding-based approaches.
Resource Optimization: The ability to trade off throughput for resource usage dynamically. The same architecture can be configured for ultra-high throughput or ultra-low resource usage by adjusting the $j/h$ ratio.
Hardware Efficiency: Introduction of a non-transposed KPU design that utilizes compressor trees and shared input buffering to reduce Look-Up Table (LUT) and DSP usage.

4. Experimental Results

The authors synthesized their designs on a Xilinx UltraScale+ (XCVU37P) FPGA using MobileNetV1 and MobileNetV2 models.

Resource Reduction (MobileNetV1): Compared to the prior state-of-the-art (Ref [11]) with the same data rate:
- LUTs: Reduced by 22% (204k $\to$ 158k).
- BRAM: Reduced by 15%.
- DSPs: Slightly reduced.
- Note: Register usage increased slightly (7%) due to the complexity of multi-pixel buffering.
Throughput vs. Resource Trade-off (MobileNetV2): The authors demonstrated a wide range of configurations:
- High Throughput: Processing 6 features (2 pixels) per cycle achieved 16,020 FPS with a latency of 0.21ms.
- Low Resource: Processing 3 features over 32 cycles achieved 219 FPS using only 30k LUTs and 212 DSPs.
- Comparison to SOTA: The high-throughput configuration achieved >3x the frames per second of the current state-of-the-art (Alveo U280 FINN implementation) while using fewer resources.

5. Significance

This paper represents a significant step forward in FPGA-based AI inference:

Scalability: It proves that data-rate-aware architectures can scale to high-throughput multi-pixel processing without sacrificing the efficiency benefits of continuous flow.
Flexibility: It provides a unified design space where a single architecture can be reconfigured to meet diverse application requirements, from real-time autonomous driving (high FPS) to edge devices with strict power/resource constraints (low FPS).
Efficiency: By eliminating the "rounding errors" of previous methods and optimizing compressor trees, it achieves superior hardware utilization, making complex CNNs feasible on single FPGAs across a broad spectrum of data rates.

Future Work: The authors note that while logic resources scale well with data rate, BRAM usage remains high because it stores model weights. Future improvements will focus on offloading weights to external memory (DRAM/HBM) to further improve scalability for larger models.