Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator

Imagine you are running a massive, 24/7 customer service call center for a super-intelligent AI. This AI (a Large Language Model, or LLM) talks to thousands of people at once. Some people ask short questions ("What's the weather?"), while others ask for a 10,000-word story. Some minutes are quiet; other minutes, the phone lines are jammed.

The problem is that the "brain" of this AI is huge, and the "memory" it needs to remember the conversation (the KV Cache) is constantly changing size and shape.

The Old Way: The Rigid Warehouse

Existing computer chips (like GPUs) and even some new "Near-Memory" chips are like rigid, pre-fabricated warehouse shelves.

The Problem: If a customer asks for a short story, the system still has to reserve a giant shelf for a 10,000-word story, just in case. This wastes huge amounts of space (memory fragmentation).
The Bottleneck: If 100 people call at once, but only 10 shelves are free, the other 90 people have to wait, even if there's plenty of total space in the building. The system is too clumsy to rearrange the shelves on the fly.
The Result: The AI gets slow and inefficient, especially when the workload is unpredictable.

The New Solution: Helios (The Dynamic Lego City)

The paper introduces Helios, a new hardware design that acts like a dynamic, self-organizing Lego city built with a special super-glue called Hybrid Bonding.

Here is how Helios fixes the problems, using simple analogies:

1. The Glue: Hybrid Bonding (The Super-Highway)

Traditional chips are like a city where the library (memory) and the office (processor) are in different buildings connected by a slow, winding road.

Helios uses Hybrid Bonding to glue the library and the office together, floor-by-floor, with millions of tiny, ultra-fast elevators (wires) connecting them directly.
Analogy: Instead of driving to the library, the workers are now standing inside the library. They can grab books (data) instantly without leaving their desks. This solves the "traffic jam" of data moving between memory and the processor.

2. The Shelves: Fine-Grained Block Management (The Puzzle Pieces)

Old systems treat a conversation like a single, giant block of concrete. If you need to add one word, you have to move the whole block.

Helios breaks conversations into tiny Puzzle Pieces (Blocks).
Analogy: Imagine you are building a wall. Old systems say, "You must use a 10-foot brick." If you only need 2 feet, you waste 8 feet. Helios says, "Use 1-foot bricks." You can fit exactly as many bricks as you need, no matter how big or small the request is.
The Magic: These puzzle pieces are scattered across all the workers (Processing Engines) in the city. If one worker has space, they take a piece. If another is busy, they take a different piece. No space is ever wasted.

3. The Traffic Cop: Spatially-Aware Allocation (The Smart GPS)

Just having puzzle pieces isn't enough; you need to know where to put them so the workers don't have to run across the city to talk to each other.

Helios has a Smart GPS (Spatially-Aware Allocation).
Analogy: When a new conversation starts, the GPS doesn't just throw the puzzle pieces anywhere. It looks at the map and says, "Put these pieces on the desks of the workers who are sitting closest to each other."
Result: The workers can pass notes to each other instantly because they are neighbors. This minimizes the time spent shouting across the room (data transfer overhead).

4. The Assembly Line: Distributed Tiled Attention (The Assembly Line)

When the AI calculates the answer, it has to look at every word the user said so far.

Old Way: Everyone stops to look at the whole book at once, then stops to calculate. It's chaotic.
Helios: It uses a Tiled Assembly Line.
Analogy: Imagine a team of chefs making a huge pizza. Instead of one chef trying to put cheese on the whole pizza at once, they split the pizza into slices. Each chef puts cheese on their slice, passes the slice to the next chef for sauce, and so on. They work in parallel, overlapping their tasks so no one stands still.
Result: The AI generates words much faster, even for very long conversations.

The Bottom Line: Why Should You Care?

The paper shows that Helios is a game-changer:

Speed: It is 3.25 times faster than the best current chips.
Efficiency: It uses 3.36 times less energy to do the same job.
Responsiveness: It handles the "rush hour" of AI requests much better, meaning you won't have to wait as long for the AI to reply, even when thousands of people are using it at once.

In short: Helios turns a clumsy, rigid warehouse into a nimble, self-organizing city where data flows instantly, space is never wasted, and the AI can talk to you faster than ever before.

Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator

The Old Way: The Rigid Warehouse

The New Solution: Helios (The Dynamic Lego City)

1. The Glue: Hybrid Bonding (The Super-Highway)

2. The Shelves: Fine-Grained Block Management (The Puzzle Pieces)

3. The Traffic Cop: Spatially-Aware Allocation (The Smart GPS)

4. The Assembly Line: Distributed Tiled Attention (The Assembly Line)

The Bottom Line: Why Should You Care?

1. Problem Statement

2. Methodology: The Helios Architecture

A. Hardware Architecture (Hybrid Bonding)

B. Operator Execution & Communication

C. System Design: Spatially-Aware KV Cache Allocation

3. Key Contributions

4. Experimental Results

5. Significance

Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator

The Old Way: The Rigid Warehouse

The New Solution: Helios (The Dynamic Lego City)

1. The Glue: Hybrid Bonding (The Super-Highway)

2. The Shelves: Fine-Grained Block Management (The Puzzle Pieces)

3. The Traffic Cop: Spatially-Aware Allocation (The Smart GPS)

4. The Assembly Line: Distributed Tiled Attention (The Assembly Line)

The Bottom Line: Why Should You Care?

1. Problem Statement

2. Methodology: The Helios Architecture

A. Hardware Architecture (Hybrid Bonding)

B. Operator Execution & Communication

C. System Design: Spatially-Aware KV Cache Allocation

3. Key Contributions

4. Experimental Results

5. Significance

More like this

XR and Hybrid Data Visualization Spaces for Enhanced Data Analytics

Biometric-enabled Personalized Augmentative and Alternative Communications

The People's Gaze: Co-Designing and Refining Gaze Gestures with General Users and Gaze Interaction Experts

Enhancing Tool Calling in LLMs with the International Tool Calling Dataset

Human-Centered Ambient and Wearable Sensing for Automated Monitoring in Dementia Care: A Scoping Review