Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute… — Plain-Language Explanation

Imagine Apple's new M4 Max computer chip as a massive, high-tech kitchen. For years, developers (the chefs) have been told that this kitchen has a special, super-fast "Tensor Oven" designed specifically to bake complex AI recipes (like Large Language Models) at lightning speed. The official instruction manual (the Metal 4.1 specification) says, "Yes, we have this oven, and it supports these specific ingredients." But the manual is frustratingly vague: it doesn't say how the oven works, where it is located, or if it's actually faster than the standard stove.

The paper RIGEL is like a team of food scientists who decided to stop reading the manual and start tasting the food to figure out what's really happening in the kitchen. They built a set of ultra-precise measuring tools to reverse-engineer the M4 Max's behavior.

Here is what they discovered, translated into everyday analogies:

1. The "Special Oven" is Actually Just a Regular Stove

The Claim: The paper proves that the M4 Max does not have a dedicated "Tensor Core" (a special hardware unit just for AI math).
The Analogy: Think of the "Tensor" operation as a specific type of cake. The manual implies there's a special, high-speed conveyor belt just for these cakes. RIGEL found that, on the M4 Max, there is no conveyor belt. Instead, the kitchen staff (the GPU shader cores) are just making these cakes by hand on the regular stove, using the same tools they use for everything else. They are doing it efficiently, but they aren't using a secret, dedicated machine.

2. The "Magic Ingredient" (FP8) is a Trick, Not a Superpower

The Claim: The paper tests a low-precision data format called FP8 (which uses half the memory of the standard FP16 format). The spec suggests this might be faster because it's smaller. RIGEL found it is not faster; it's actually slightly slower.
The Analogy: Imagine you are carrying water buckets. The FP16 format is a large bucket; the FP8 format is a small bucket. You might think, "If I use small buckets, I can carry more trips in the same time!" But the scientists found that on the M4 Max, the workers have to stop and pour the small water into a big bucket before they can use it. This "pouring" (unpacking) takes time.

The Result: Using the small buckets (FP8) doesn't make the job go faster. It only saves you from having to carry as many heavy buckets at once (saving memory space). It's a space-saver, not a speed-saver. The paper calls this a "memory-footprint feature, not a performance feature."

3. The "Secret Recipe" Layout

The Claim: The manual says the way data is arranged in the "Tensor" memory is "opaque" (hidden) and "device specific." RIGEL figured out exactly how it's arranged.
The Analogy: Imagine the kitchen staff is told to arrange ingredients in a grid, but the manual just says, "Do it in a secret pattern." RIGEL watched the staff and realized they are arranging the ingredients in a very specific 8x8 square pattern. Once the scientists figured out this secret grid, they could rearrange the ingredients to make the cooking process smoother.

4. The "Version Gate" (The Bouncer)

The Claim: The manual says you can use these features with a certain software version (Xcode 26.1+), but RIGEL found that's a lie.
The Analogy: The manual says, "Anyone with a red hat can enter the VIP room." But when RIGEL tried to enter with a red hat from last year (Xcode 26.5), the bouncer kicked them out. You actually need a brand new red hat (Xcode 27) and a specific ID card (macOS 27.0) to even get the door to open. The manual was simply wrong about who could get in.

5. The "Super-Chef" Optimization

The Claim: Because the scientists now know exactly how the kitchen works (no secret oven, 8x8 grid, FP8 is slow), they wrote a new, custom recipe.
The Analogy: Instead of following the standard, step-by-step instructions (Bake Cake -> Add Icing -> Add Fruit), which involves walking back and forth to the pantry three times, the scientists wrote a "fusion" recipe. They combined the baking, icing, and fruit steps into one smooth motion right at the stove.

The Result: This custom recipe was 6.5% to 12.9% faster for tasks that fit in the kitchen's immediate workspace (cache-resident). However, for huge tasks that require running back and forth to the warehouse (main memory), the speedup disappears because the walking time dominates the cooking time.

Summary of the "Big Reveal"

The paper concludes that on the Apple M4 Max:

No Magic Hardware: There is no special AI engine; it's all running on the standard graphics cores.
FP8 is for Storage, Not Speed: Using smaller numbers saves space but doesn't make calculations faster.
The Manual is Incomplete: The official documentation hides the hardware reality, the exact memory layout, and the true software requirements.
Custom Code Wins: If you know the secret layout, you can write a custom program that beats the standard Apple tools by a small but measurable margin.

The authors emphasize that these findings are based on hard data from a single M4 Max chip running a beta version of the operating system, and they have released all their code so anyone can verify the results themselves.

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

1. The "Special Oven" is Actually Just a Regular Stove

2. The "Magic Ingredient" (FP8) is a Trick, Not a Superpower

3. The "Secret Recipe" Layout

4. The "Version Gate" (The Bouncer)

5. The "Super-Chef" Optimization

Summary of the "Big Reveal"

Technical Summary: RIGEL – Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max

Problem Statement

Methodology: The RIGEL Harness

Key Findings and Results

1. Execution Target: No Dedicated Matrix Unit

2. The "Headline": fp8 is Emulated, Not Accelerated

3. Numeric Semantics and Hidden Constraints

4. Optimization Opportunities

Contradictions and Version Gates

Significance and Conclusion

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

1. The "Special Oven" is Actually Just a Regular Stove

2. The "Magic Ingredient" (FP8) is a Trick, Not a Superpower

3. The "Secret Recipe" Layout

4. The "Version Gate" (The Bouncer)

5. The "Super-Chef" Optimization

Summary of the "Big Reveal"

Technical Summary: RIGEL – Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max

Problem Statement

Methodology: The RIGEL Harness

Key Findings and Results

1. Execution Target: No Dedicated Matrix Unit

2. The "Headline": fp8 is Emulated, Not Accelerated

3. Numeric Semantics and Hidden Constraints

4. Optimization Opportunities

Contradictions and Version Gates

Significance and Conclusion

More like this