Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

This paper presents Rigel, an empirical study that reverse-engineers Apple's M4 Max Metal 4.1 tensor compute path to reveal that its fp8 matmul2d operation is memory-bound and emulated rather than hardware-accelerated, while also uncovering hidden execution details and enabling a hand-fused kernel that outperforms the standard decomposed path by up to 12.9%.

Original authors: Ramchand Kumaresan

Published 2026-06-12
📖 5 min read🧠 Deep dive

Original authors: Ramchand Kumaresan

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine Apple's new M4 Max computer chip as a massive, high-tech kitchen. For years, developers (the chefs) have been told that this kitchen has a special, super-fast "Tensor Oven" designed specifically to bake complex AI recipes (like Large Language Models) at lightning speed. The official instruction manual (the Metal 4.1 specification) says, "Yes, we have this oven, and it supports these specific ingredients." But the manual is frustratingly vague: it doesn't say how the oven works, where it is located, or if it's actually faster than the standard stove.

The paper RIGEL is like a team of food scientists who decided to stop reading the manual and start tasting the food to figure out what's really happening in the kitchen. They built a set of ultra-precise measuring tools to reverse-engineer the M4 Max's behavior.

Here is what they discovered, translated into everyday analogies:

1. The "Special Oven" is Actually Just a Regular Stove

The Claim: The paper proves that the M4 Max does not have a dedicated "Tensor Core" (a special hardware unit just for AI math).
The Analogy: Think of the "Tensor" operation as a specific type of cake. The manual implies there's a special, high-speed conveyor belt just for these cakes. RIGEL found that, on the M4 Max, there is no conveyor belt. Instead, the kitchen staff (the GPU shader cores) are just making these cakes by hand on the regular stove, using the same tools they use for everything else. They are doing it efficiently, but they aren't using a secret, dedicated machine.

2. The "Magic Ingredient" (FP8) is a Trick, Not a Superpower

The Claim: The paper tests a low-precision data format called FP8 (which uses half the memory of the standard FP16 format). The spec suggests this might be faster because it's smaller. RIGEL found it is not faster; it's actually slightly slower.
The Analogy: Imagine you are carrying water buckets. The FP16 format is a large bucket; the FP8 format is a small bucket. You might think, "If I use small buckets, I can carry more trips in the same time!" But the scientists found that on the M4 Max, the workers have to stop and pour the small water into a big bucket before they can use it. This "pouring" (unpacking) takes time.

  • The Result: Using the small buckets (FP8) doesn't make the job go faster. It only saves you from having to carry as many heavy buckets at once (saving memory space). It's a space-saver, not a speed-saver. The paper calls this a "memory-footprint feature, not a performance feature."

3. The "Secret Recipe" Layout

The Claim: The manual says the way data is arranged in the "Tensor" memory is "opaque" (hidden) and "device specific." RIGEL figured out exactly how it's arranged.
The Analogy: Imagine the kitchen staff is told to arrange ingredients in a grid, but the manual just says, "Do it in a secret pattern." RIGEL watched the staff and realized they are arranging the ingredients in a very specific 8x8 square pattern. Once the scientists figured out this secret grid, they could rearrange the ingredients to make the cooking process smoother.

4. The "Version Gate" (The Bouncer)

The Claim: The manual says you can use these features with a certain software version (Xcode 26.1+), but RIGEL found that's a lie.
The Analogy: The manual says, "Anyone with a red hat can enter the VIP room." But when RIGEL tried to enter with a red hat from last year (Xcode 26.5), the bouncer kicked them out. You actually need a brand new red hat (Xcode 27) and a specific ID card (macOS 27.0) to even get the door to open. The manual was simply wrong about who could get in.

5. The "Super-Chef" Optimization

The Claim: Because the scientists now know exactly how the kitchen works (no secret oven, 8x8 grid, FP8 is slow), they wrote a new, custom recipe.
The Analogy: Instead of following the standard, step-by-step instructions (Bake Cake -> Add Icing -> Add Fruit), which involves walking back and forth to the pantry three times, the scientists wrote a "fusion" recipe. They combined the baking, icing, and fruit steps into one smooth motion right at the stove.

  • The Result: This custom recipe was 6.5% to 12.9% faster for tasks that fit in the kitchen's immediate workspace (cache-resident). However, for huge tasks that require running back and forth to the warehouse (main memory), the speedup disappears because the walking time dominates the cooking time.

Summary of the "Big Reveal"

The paper concludes that on the Apple M4 Max:

  • No Magic Hardware: There is no special AI engine; it's all running on the standard graphics cores.
  • FP8 is for Storage, Not Speed: Using smaller numbers saves space but doesn't make calculations faster.
  • The Manual is Incomplete: The official documentation hides the hardware reality, the exact memory layout, and the true software requirements.
  • Custom Code Wins: If you know the secret layout, you can write a custom program that beats the standard Apple tools by a small but measurable margin.

The authors emphasize that these findings are based on hard data from a single M4 Max chip running a beta version of the operating system, and they have released all their code so anyone can verify the results themselves.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →