Energy Efficient Exact and Approximate Systolic Array Architecture for Matrix Multiplication

Imagine you are running a massive, high-speed factory that builds complex 3D puzzles. This factory is the brain of modern Artificial Intelligence (AI). Every day, it has to perform billions of tiny calculations called "multiplications" to figure out how to recognize a cat in a photo, translate a sentence, or drive a car.

The paper you shared is about redesigning the workers inside this factory to make them faster, cheaper, and much more energy-efficient, without ruining the final puzzle.

Here is the breakdown using simple analogies:

1. The Problem: The Exhausted Factory

In current AI computers (like the ones in your phone or Google's servers), the "workers" (called Processing Elements or PEs) are perfectionists. They calculate every single number with 100% exact precision.

The Issue: Being a perfectionist takes a lot of energy and space. It's like hiring a team of accountants who count every single grain of sand on a beach to measure the beach's size. It's accurate, but it's exhausting and slow.
The Result: These factories (chips) get hot, drain batteries quickly, and are too big to fit in small devices like smartwatches or drones.

2. The Solution: The "Good Enough" Workers

The authors of this paper proposed a new design for these workers. They introduced two types of workers:

The Exact Worker: Still perfect, but built with a smarter, more efficient blueprint.
The Approximate Worker: This is the star of the show. This worker is willing to make tiny, almost invisible mistakes to save huge amounts of energy.

The Analogy:
Imagine you are painting a picture of a sunset.

The Exact Worker measures the color of every single pixel to ensure the orange is exactly #FFA500.
The Approximate Worker looks at the orange and says, "That's close enough to sunset orange," and moves on.
The Catch: To the human eye, the painting looks identical. But the Approximate Worker used 68% less energy to finish the job!

3. The Secret Sauce: "Partial Product Cells"

How did they make these workers so efficient? They redesigned the tools the workers use.

Old Tools: The old tools were heavy and clunky. They had to do a lot of extra steps to handle negative numbers (like debts in a bank account).
New Tools (PPC & NPPC): The authors invented new, lightweight tools.
- PPC (Positive Tool): Handles the "good" numbers efficiently.
- NPPC (Negative Tool): Handles the "debt" numbers efficiently using a clever trick (like using a NAND gate, which is a simple logic switch).
The Result: These new tools are smaller, faster, and use less electricity. It's like swapping a heavy steam engine for a sleek electric motor.

4. The Proof: Does the Picture Still Look Good?

You might ask, "If they make mistakes, won't the AI fail?"
The authors tested this in three real-world scenarios:

Compressing Photos (DCT): Imagine shrinking a photo to fit in an email.
- Result: The photo looked almost perfect (45.97 dB quality). The "mistakes" were so small you couldn't see them.
Finding Edges (Kernel Method): Imagine a robot trying to draw the outline of a cup.
- Result: With a moderate amount of "sloppiness," the outline was still very clear (30.45 dB).
Smart Edge Detection (CNN): This is the big test. Imagine a self-driving car trying to see the edge of a road.
- Result: Amazingly, the system was super accurate (75.98 dB). Why? Because the AI network is smart enough to ignore the tiny errors made by the workers and still figure out the road perfectly.

5. Why This Matters

This paper is a game-changer for the future of technology because:

Battery Life: Your phone or drone could last much longer because the computer isn't wasting energy on "perfect" math that no one notices.
Size: These chips can be smaller, meaning we can put powerful AI into tiny devices (like hearing aids or smart glasses).
Speed: By doing things faster and with less heat, AI can run smoother.

The Bottom Line

The authors didn't just make the workers faster; they made them smarter about when to be perfect and when to be "good enough." They proved that in the world of AI, you don't need to be perfect to be brilliant. You just need to be efficient.

In short: They built a super-efficient factory that saves massive amounts of energy while still producing pictures and decisions that look and feel exactly the same to us humans.

1. Problem Statement

Deep Neural Networks (DNNs) and other AI workloads rely heavily on massive matrix multiplications, specifically Multiply-Accumulate (MAC) operations. These operations are computationally and energy-intensive, creating a bottleneck for deployment on resource-constrained edge devices and IoT systems.

Limitations of Current Solutions: Conventional Systolic Arrays (SAs) use exact arithmetic units, leading to high energy consumption and large silicon area overheads.
The Trade-off: While approximate computing offers a path to energy efficiency by trading precision for performance, existing approximate designs often lack optimized hardware structures for signed operations or fail to balance energy savings with acceptable output quality in image processing tasks.

2. Methodology

The authors propose a novel Systolic Array architecture that integrates both exact and approximate Processing Elements (PEs) designed for signed and unsigned operations. The core innovation lies in the design of the fundamental building blocks: Partial Product Cells (PPC) and Nand-based Partial Product Cells (NPPC).

A. Hardware Architecture

Unified MAC Approach: The proposed PE fuses multiplication and accumulation into a single unit ( $a \times b + c$ ), reducing delay and improving efficiency compared to traditional separate multiplier-adder structures.
Exact Design: Utilizes optimized exact PPC and NPPC cells. The NPPC cells are specifically crucial for handling signed arithmetic efficiently.
Approximate Design: Introduces approximate PPC and NPPC cells where specific logic gates are simplified to reduce complexity.
- Logic Simplification: The approximate PPC uses the expression $S_{out} = (S_{in} + C_{in}) + (a_i \cdot b_i)$ , while the NPPC uses $S_{out} = (S_{in} + C_{in}) \cdot (a_i \cdot b_i)$ .
- Error Profile: The approximate PPC introduces errors in 5 out of 16 input combinations, resulting in an error rate of $5/16$ (total block error probability of $25/256$ ).
- Approximation Factor ( $k$ ): The design allows for tunable approximation levels ( $k = N-1$ ), where $N$ is the bit-width, enabling a trade-off between energy savings and accuracy.

B. Implementation Details

Technology: Designed using Verilog HDL and synthesized with Cadence Genus using a 90-nm UMC technology node.
Configuration: Evaluated on 8-bit signed and unsigned PEs within various SA sizes (from $3 \times 3$ to $16 \times 16$ ).

3. Key Contributions

Novel PE Architectures: Proposed energy-efficient exact and approximate PEs utilizing optimized PPC and NPPC cells.
Significant Energy Savings:
- The proposed approximate PPC/NPPC achieves 46.8% and 34.4% energy savings, respectively, compared to the best existing designs.
- The proposed exact signed 8-bit PE achieves a 24.37% energy saving over existing exact designs.
- The proposed approximate signed 8-bit PE achieves a 22.51% energy saving over existing approximate designs.
System-Level Efficiency: When integrated into an $8 \times 8$ Systolic Array, the proposed exact and approximate designs achieve 16% and 68% energy savings, respectively, compared to existing baselines.
Application Validation: Validated the architecture on three distinct applications: Discrete Cosine Transform (DCT) for image compression, kernel-based edge detection, and CNN-based edge detection.

4. Results and Performance Metrics

Hardware Metrics (90nm, 250 MHz)

Power-Delay-Area Product (PADP): The proposed approximate PE shows up to a 65.45% improvement in PADP compared to design [13] and a 23% improvement over the best existing approximate design [5].
Scalability: As SA size increases (e.g., $16 \times 16$ ), the approximate design achieves a 62.7% reduction in Power-Delay Product (PDP) compared to exact designs and a 24.2% improvement over existing approximate designs.

Application-Specific Quality (PSNR & SSIM)

The authors evaluated the output quality using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM):

Application	Design	Approx. Factor ( $k$ )	PSNR (dB)	SSIM
DCT (Image Compression)	Proposed	2	45.97	0.991
	Proposed	8	28.43	0.872
Kernel Edge Detection	Proposed	2	30.45	0.910
	Proposed	8	11.41	0.651
CNN Edge Detection (BDCN)	Proposed	2	75.98	1.000
	Proposed	8	34.60	0.995

Key Insight: The CNN-based approach (BDCN) demonstrates superior resilience to approximation errors. Even with high approximation factors ( $k=8$ ), the network maintains high PSNR (34.60 dB) and SSIM (0.995), significantly outperforming the kernel-based method at similar approximation levels.

5. Significance and Conclusion

This paper presents a robust solution for the energy constraints of modern AI accelerators. By introducing optimized PPC and NPPC cells, the authors successfully decouple energy consumption from strict arithmetic precision.

Error Resilience: The results confirm that the proposed architecture is highly suitable for error-resilient applications like image and vision processing, where minor computational inaccuracies are imperceptible to the human eye or can be compensated for by higher-level network structures (as seen in the CNN results).
Practical Impact: The ability to achieve up to 68% energy savings while maintaining competitive output quality (e.g., 45.97 dB PSNR for DCT) makes this architecture a strong candidate for deployment in edge devices, IoT sensors, and mobile AI systems where battery life and thermal constraints are critical.
Future Direction: The work highlights the potential of hybrid architectures (combining approximate early layers with exact later layers) to maximize efficiency without sacrificing the final accuracy of deep learning models.