A Hybrid Residue Floating Numerical Architecture with Formal Error Bounds for High Throughput FPGA Computation

Imagine you are running a massive, high-speed factory that processes numbers. This factory is built on a special kind of machine called an FPGA (Field-Programmable Gate Array), which is like a Lego set for computers that can be rearranged to do specific jobs incredibly fast.

For decades, the factory has used a standard method called Floating-Point Arithmetic (like the math your calculator uses). It's very flexible and handles huge numbers and tiny decimals perfectly. But, it's also clunky and slow. Every time two numbers meet, they have to stop, line up their decimal points, check if they are too big, and shuffle bits around. It's like two people trying to have a conversation, but they have to stop every sentence to check their watches, adjust their glasses, and make sure they are speaking the same volume. It works, but it wastes a lot of time and energy.

The Problem: The "Traffic Jam"

The author of this paper, Mostafa Darvishi, noticed that this "stop-and-check" process creates a traffic jam in the factory. The machines are so busy organizing the numbers that they can't actually do the math as fast as they could.

The Solution: The "HRFNA" Factory

The paper introduces a new system called HRFNA (Hybrid Residue–Floating Numerical Architecture). Think of HRFNA as a completely redesigned factory floor that uses a clever trick to avoid the traffic jam.

Here is how it works, using a simple analogy:

1. The "Residue" System (The Parallel Assembly Lines)

Imagine you have a huge number, say 123. In the old system, you have to write it out as 1-2-3 and carry over digits if it gets too big.
In HRFNA, instead of writing the whole number, we break it down into three different "views" using three different clocks (moduli).

Clock A says: "It's 3."
Clock B says: "It's 1."
Clock C says: "It's 5."

The magic is that you can do math on these three views simultaneously and independently.

If you want to multiply two numbers, you just multiply the "3s," the "1s," and the "5s" at the exact same time.
No carrying over! There is no waiting for one line to finish before the next one starts. It's like having 100 workers painting a wall at the same time, rather than one worker painting the whole thing. This is the "Carry-Free" part.

2. The "Floating" Part (The Volume Knob)

The problem with the "Residue" system is that it's great at math, but it's bad at knowing how big the number actually is. It's like having three people describing a car, but none of them know if it's a toy car or a truck.

HRFNA adds a single "Volume Knob" (an exponent) to the whole group.

The three views do the math fast and furious.
The Volume Knob just sits there, watching.
If the numbers get too huge (like if the factory starts producing giant trucks instead of toy cars), the Volume Knob gets turned down once to shrink everything back to a manageable size.

3. The "Normalization" (The Rare Cleanup)

In the old system, you had to check and adjust the volume knob after every single math problem.
In HRFNA, the Volume Knob only gets adjusted when the numbers get really big.

Analogy: Imagine a chef chopping vegetables. In the old system, the chef stops after every chop to measure the pile. In HRFNA, the chef chops, chops, chops, and only stops once every hour to measure the pile and maybe move it to a bigger bowl.
This "stop" is called Normalization. Because it happens so rarely, the factory never stops for long.

Why is this a Big Deal?

The paper proves that this new system isn't just a cool trick; it's mathematically sound and safe.

Speed: Because the factory doesn't stop to check the volume knob constantly, it runs 2.4 times faster than the old system.
Efficiency: It uses 38–55% less space on the chip (like fitting a bigger factory into a smaller building).
Accuracy: The author proved that the "Volume Knob" adjustments introduce very tiny errors, and these errors are predictable and bounded. It's not a "guess"; it's a calculated, safe margin of error.
Stability: They tested it on complex tasks like solving physics equations (ODE solvers) and multiplying huge matrices. The system didn't crash or drift off course; it stayed stable for millions of steps.

The Bottom Line

Think of HRFNA as a high-speed train compared to the old stop-and-go bus.

The Bus (Floating-Point) stops at every station to pick up and drop off passengers (normalization), making the trip slow and expensive.
The Train (HRFNA) runs on parallel tracks (Residue) and only stops at major terminals (Normalization) when absolutely necessary. It gets you to the destination much faster, uses less fuel, and arrives with a predictable schedule.

This new architecture is a game-changer for scientific computing, AI, and engineering simulations running on FPGAs, offering the best of both worlds: the speed of simple math and the flexibility of handling huge numbers.

Here is a detailed technical summary of the paper "A Hybrid Residue–Floating Numerical Architecture with Formal Error Bounds for High-Throughput FPGA Computation" by Mostafa Darvishi.

1. Problem Statement

Field-Programmable Gate Arrays (FPGAs) are increasingly used for numerically intensive workloads (scientific computing, signal processing, CAD). However, implementing standard IEEE-754 floating-point arithmetic on FPGAs is inherently inefficient due to:

High Hardware Cost: Wide datapaths, complex normalization logic, and carry propagation significantly increase area (LUTs) and power consumption.
Latency Bottlenecks: Multi-stage carry propagation and exponent alignment limit throughput and scalability in deeply pipelined designs.
Alternative Limitations:
- Fixed-point: Lacks the dynamic range required for iterative algorithms and long accumulation chains.
- Residue Number Systems (RNS): Offer carry-free parallelism but struggle with scaling, comparison, sign detection, and fractional representation without expensive reconstruction (Chinese Remainder Theorem - CRT).
- Logarithmic Number Systems (LNS): Efficient for multiplication but costly for addition/subtraction.
- Existing Hybrids: Often lack rigorous mathematical foundations, formal error bounds, or application-level stability validation.

There is a critical gap for a numerical system that simultaneously offers carry-free parallelism, wide dynamic range, bounded/analyzable error, and hardware efficiency suitable for general-purpose FPGA computation.

2. Methodology: The Hybrid Residue–Floating Numerical Architecture (HRFNA)

The paper proposes HRFNA, a fully specified numerical system that decouples integer arithmetic from dynamic-range management.

A. Mathematical Foundation

Number Space Definition: A hybrid number is defined as a tuple $(\mathbf{r}, f)$ $(r, f)$ , where $\mathbf{r}$ $r$ is a residue vector (representing the integer magnitude) and $f$ $f$ is a global exponent (representing the scale).
- Value: $\Phi(\mathbf{r}, f) = \text{CRT}(\mathbf{r}) \cdot 2^f$ .
Arithmetic Operations:
- Multiplication: Performed entirely in the residue domain ( $\mathbf{r}_Z = \mathbf{r}_X \odot \mathbf{r}_Y$ ) with simple exponent addition ( $f_Z = f_X + f_Y$ ). This is exact and carry-free.
- Addition: Requires exponent synchronization (scaling one operand) before residue addition.
Normalization Strategy:
- Unlike floating-point, normalization is not performed after every operation.
- It is triggered only when the reconstructed integer magnitude exceeds a predefined threshold $\tau$ .
- When triggered, the value is scaled down by a power of two ($2^s$), and the exponent is incremented. This is a rare, deterministic event.

B. Error Analysis

Bounded Error: The paper proves that numerical error is introduced only during normalization events.
Bounds: Explicit absolute and relative error bounds are derived. The relative error is bounded by $2^{-s} $, where$ s$ is the scaling step. This ensures that error growth is predictable and does not accumulate linearly with every operation, unlike standard floating-point rounding.

C. Hardware Microarchitecture

The FPGA implementation separates the datapath into three loosely coupled subsystems to maximize throughput:

Residue Arithmetic Pipeline: A bank of parallel modular arithmetic units (one per modulus) performing carry-free addition and multiplication.
Exponent Management Pipeline: A lightweight integer pipeline handling exponent updates and synchronization.
CRT-Based Normalization Engine: An off-path engine activated only when magnitude thresholds are exceeded. It reconstructs the integer, scales it, and re-encodes it into residues.

Magnitude Estimation: To avoid expensive CRT reconstruction for every comparison, HRFNA uses interval-based magnitude estimation (floating-point intervals) to guide normalization decisions. Only the selected candidate is fully reconstructed if normalization is needed.

3. Key Contributions

Formal Numerical Model: Defined a hybrid number space with semantic mapping, proving the correctness of arithmetic operations and establishing that HRFNA behaves as a deterministic block-floating-like system with carry-free core arithmetic.
Rigorous Error Bounds: Derived explicit absolute and relative error bounds, confining rounding to infrequent normalization events, thereby guaranteeing bounded numerical error over long computation sequences.
High-Throughput FPGA Microarchitecture: Designed a deeply pipelined architecture sustaining an Initiation Interval (II) of one cycle under steady-state operation. Normalization latency is amortized over thousands of arithmetic operations.
Application-Level Validation: Validated the system on diverse workloads (Dot Products, Matrix Multiplication, Runge-Kutta ODE solvers), demonstrating long-term numerical stability.
Comprehensive Evaluation: Provided a comparative analysis against IEEE-754 FP32, Block Floating-Point (BFP), and prior hybrid systems, identifying a new design point in the trade-off space.

4. Experimental Results

The system was implemented on a Xilinx Zynq UltraScale+ ZCU104 FPGA and compared against IEEE-754 FP32 baselines.

Performance: Achieved up to 2.4× higher throughput than FP32.
Resource Efficiency: Reduced Look-Up Table (LUT) usage by 38–55%.
Energy Efficiency: Improved energy efficiency by up to 1.9×.
Numerical Accuracy:
- Dot Products: Maintained RMS error below $10^{-6}$ across vector lengths up to 64k, avoiding the error drift seen in Block Floating-Point systems.
- Matrix Multiplication: Preserved error bounds ( $< 2 \times 10^{-6}$ ) for $128 \times 128$ matrices.
- ODE Solvers: Demonstrated stable behavior over $10^6$ time steps in a Runge-Kutta solver, with no exponential error growth or divergence.
Normalization Overhead: Normalization events occurred orders of magnitude less frequently than arithmetic operations (e.g., once per several thousand ops), confirming that the CRT reconstruction cost is effectively amortized.

5. Significance and Impact

New Design Point: HRFNA occupies a previously unexplored niche that combines the dynamic range and stability of floating-point with the hardware efficiency and parallelism of residue arithmetic.
Predictability: By making normalization explicit and analyzable, HRFNA offers a level of numerical predictability crucial for scientific computing and CAD tools, where strict IEEE compliance is less important than bounded error and high throughput.
Scalability: The architecture is synthesis-friendly and scales well with deep pipelining, making it ideal for high-throughput accelerators in heterogeneous computing platforms.
General Purpose: Unlike many prior hybrid systems restricted to specific domains (e.g., cryptography), HRFNA is validated as a general-purpose numerical system suitable for iterative solvers and linear algebra.

In conclusion, the paper demonstrates that separating arithmetic execution from scale management, grounded in formal error analysis, allows for the creation of a numerical architecture that significantly outperforms traditional floating-point implementations on FPGAs while maintaining rigorous numerical stability.