Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Imagine you are running a massive, high-speed library where the goal is to answer millions of complex questions from people all at once. The "books" in this library are Large Language Models (LLMs)—giant AI brains that can write, reason, and create art. The "librarians" are the GPUs (computer chips) that do the heavy lifting.

This paper is a report card on how well a specific type of librarian, the AMD Instinct MI325X, performs when trying to serve four different, super-complex AI "books" to a crowd of up to 1,000 people simultaneously.

Here is the breakdown in simple terms, using some fun analogies.

1. The Setup: A Super-Strong Library

The researchers set up a server with 8 AMD MI325X chips.

The Memory (HBM3e): Think of this as the librarian's desk space. Each chip has a massive 256GB desk. With 8 chips, they have a 2 Terabyte desk. That's huge! It means they can keep almost the entire "book" open on the desk at once without having to run back and forth to the basement (the CPU) to fetch pages. This is a game-changer because running back and forth slows everything down.
The Speed: These chips are incredibly fast at moving data, like a super-highway for information.

2. The Four "Books" (AI Models)

The team tested four different AI models, which are like four different types of books with very different writing styles:

Llama-3.1-405B: A Dense book. Every single page is read for every question. It's heavy and requires a lot of brainpower.
DeepSeek V3.2 & Kimi-K2.5: MoE (Mixture of Experts) books with MLA (Multi-head Latent Attention). Imagine a book where, for every question, only a tiny team of 3 or 4 specific experts is called in to answer, while the rest of the 384 experts take a coffee break. This makes them very efficient, but they use a special, compressed way of organizing their notes (MLA) that is tricky to read.
Qwen3-VL-235B: A Vision book. It can look at pictures and read text. It's also a "Mixture of Experts" but uses a more standard way of organizing notes (GQA).

3. The Big Discovery: "One Size Does Not Fit All"

The most important lesson from this paper is that you cannot use the same instructions for every book.

The "MLA" Problem: The DeepSeek and Kimi books use a special "compressed notes" system (MLA). On the AMD chips, this system is picky.
- It requires the librarian to process notes one by one (Block Size 1), which is slow.
- It refuses to use the "basement storage" (KV Cache Offloading) even though the desk is huge.
- Crucially: To read these books fast, the librarian must use a special tool called AITER. Without it, the reading speed drops to a crawl. However, for one specific book (Kimi), the tool doesn't fit at all because the book has too many "chapters" (attention heads) for the tool to handle, so the tool had to be turned off completely.
The "GQA" Freedom: The Llama and Qwen books use a standard note system. They are flexible. They can use the basement storage if needed and work well with standard tools.

The Analogy: Imagine trying to drive a Ferrari and a Tractor on the same road. If you treat them both exactly the same, the Ferrari might get stuck, and the Tractor might go too slow. You need different driving modes for each.

4. The Results: Who Won the Race?

The Text Race (Answering questions with words only)

The Surprise: The Llama book (405 Billion words) and the DeepSeek book (only 37 Billion active words) finished at almost the exact same speed.
Why? Even though DeepSeek is a "Mixture of Experts" (only using a small team), the special "compressed notes" system (MLA) it uses is a bit clunky on AMD chips. It cancels out the efficiency gains.
The Winner: They tied, but Llama was slightly more consistent.

The Vision Race (Answering questions with pictures)

The Big Gap: The Qwen book (MoE + Standard Notes) was 6.5 times faster than the Kimi book (MoE + Compressed Notes).
Why? Qwen could use all 8 chips and standard tools. Kimi was forced to use only 4 chips (because of the "chapter count" issue) and couldn't use the special speed-up tool (AITER).
Lesson: If you want speed on AMD chips right now, avoid the "compressed notes" (MLA) models unless you have the specific tools to make them work.

5. The Bottleneck: The "Highway" vs. The "Engine"

The researchers found something fascinating about why the speed stops increasing.

The Engine: The chips are powerful enough to calculate answers very fast.
The Highway: The speed limit is actually the memory bandwidth (how fast data can move onto the desk).
The Saturation Point: No matter how many people you add to the queue, once you hit about 500 people (for short questions), the highway gets jammed. Adding more people just makes them wait longer; it doesn't make the library answer faster.
The Good News: The library never crashed. Even with 1,000 people screaming for answers, every single request was answered successfully. The system just got slower, but it didn't break.

6. Practical Takeaways for the Real World

If you are a company trying to run these AI models on AMD chips, here is the cheat sheet:

Check the Model Type First: Before you start, ask: "Is this model using MLA (compressed notes) or GQA (standard notes)?"
Turn on the Special Tool (AITER): For most models, you need this AMD tool to get good speed. But be careful: for the "Kimi" model, you must turn it OFF, or the system will crash.
Don't Overcrowd: If you have short questions, aim for about 500 people at a time. If you have long questions, aim for 100–200. Adding more won't help; it will just create a traffic jam.
Big Memory is King: Because these AMD chips have such huge desks (256GB each), you rarely need to run back to the basement (CPU) to get data. This keeps things fast and simple.

Summary

This paper proves that AMD's new chips are powerful enough to run the world's biggest AI models, including a 1-trillion-parameter giant. However, to get the best performance, you can't just use a "default" setting. You have to be a smart librarian who knows exactly which tool to use for which book. If you get the configuration right, you can get blazing-fast speeds; if you get it wrong, the system will struggle.

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

1. The Setup: A Super-Strong Library

2. The Four "Books" (AI Models)

3. The Big Discovery: "One Size Does Not Fit All"

4. The Results: Who Won the Race?

The Text Race (Answering questions with words only)

The Vision Race (Answering questions with pictures)

5. The Bottleneck: The "Highway" vs. The "Engine"

6. Practical Takeaways for the Real World

Summary

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

1. The Setup: A Super-Strong Library

2. The Four "Books" (AI Models)

3. The Big Discovery: "One Size Does Not Fit All"

4. The Results: Who Won the Race?

The Text Race (Answering questions with words only)

The Vision Race (Answering questions with pictures)

5. The Bottleneck: The "Highway" vs. The "Engine"

6. Practical Takeaways for the Real World

Summary

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue

LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks

On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning