RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators

RedFuser is an automatic framework that employs a formal theoretical methodology to identify and fuse cascaded reduction operations into optimized single-loop kernels, achieving significant speedups over state-of-the-art AI compilers while matching the performance of hand-written solutions.

Xinsheng Tang, Yangcheng Li, Nan Wang, Zhiyi Shu, Xingyu Ling, Junna Xing, Peng Zhou, Qiang Liu

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you are running a massive, high-speed kitchen (an AI accelerator) trying to prepare a complex dish for millions of customers (AI models). The recipe involves a series of steps where you have to taste, adjust, and combine ingredients repeatedly.

In the world of AI, these steps are called Cascaded Reductions. They are like a chain of tasks where you can't start the next one until the previous one is finished. For example, to make a "Safe Softmax" (a common math operation in AI), you first have to find the biggest number in a list, then subtract that number from everything else, then add up the results, and finally divide.

The Problem: The "Stop-and-Go" Traffic Jam

Currently, most AI compilers (the chefs' managers) handle this chain of tasks inefficiently. They treat each step as a separate trip to the pantry.

  1. Trip 1: Go get the ingredients, find the biggest number, write it down, and go back to the pantry.
  2. Trip 2: Go get the ingredients again, subtract the number, write down the sum, and go back.
  3. Trip 3: Go get the ingredients again to do the final division.

This is like a delivery driver who drops off a package, drives all the way back to the warehouse to pick up the next package, and drives back out again, even though the next package was sitting right next to the first one. It wastes time (latency) and clogs up the roads (memory bandwidth).

Furthermore, because each step depends on the result of the last, the kitchen has to store all the intermediate results on the counter. If the list of ingredients is huge, the counter gets cluttered, and the chef runs out of space to work.

The Solution: RedFuser (The "Super-Organizer")

The paper introduces RedFuser, a new framework that acts like a genius sous-chef who realizes, "Wait, we don't need to make three separate trips! We can do all these steps in one smooth motion."

RedFuser uses a clever mathematical trick to fuse (merge) these separate steps into a single, continuous loop.

The Creative Analogy: The Assembly Line vs. The Relay Race

  • Old Way (Relay Race): Imagine a relay race where Runner A runs the whole track, stops, hands a baton to Runner B, who then runs the whole track again, stops, and hands it to Runner C. They are all running the same distance, but they keep stopping and starting.
  • RedFuser Way (Assembly Line): Now imagine an assembly line where the product moves down a conveyor belt. As it passes Station A, a worker adds a screw. As it immediately passes Station B, another worker adds a nut. As it passes Station C, a worker tightens it. The product never stops moving, and the workers don't have to run back and forth.

How RedFuser Works (The Magic Tricks)

1. The "One-Time Load" Trick
Instead of loading the data from the main pantry (memory) three times, RedFuser loads it once. It keeps the data in a small, super-fast "pocket" (on-chip memory) and performs all the math operations on it while it's there. This eliminates the traffic jams.

2. The "Incremental Update" Trick (The Rolling Calculator)
This is the paper's most brilliant insight. Usually, to do these steps, you need to know the entire result of the previous step before you can start the next one. If you have a million numbers, you'd need a huge counter to hold the sum of the first million before you can do the next step.

RedFuser uses an Incremental Computation method. Think of it like a rolling calculator:

  • Instead of waiting for the whole list to finish adding up, you add the first number, update your total, add the second number, update the total, and so on.
  • You don't need a giant counter; you just need a tiny pocket to hold the current total.
  • This allows the system to handle massive lists of data without running out of "counter space" (memory), even on small devices.

The Results: Speeding Up the Kitchen

The authors tested RedFuser on real-world AI tasks like:

  • Attention Mechanisms: The core of how AI models (like Chatbots) understand context.
  • MoE Routing: How models decide which "expert" to ask for help.
  • FP8 Quantization: A way to shrink models to make them faster.

The Outcome:

  • RedFuser made these tasks run 2 to 5 times faster than the best existing AI compilers (like TVM or PyTorch).
  • It performed just as well as hand-written code by expert engineers (who usually spend weeks manually optimizing these specific tasks).
  • Crucially, it did this automatically. You don't need a PhD in math to use it; the framework figures out the fusion for you.

In a Nutshell

RedFuser is a tool that stops AI computers from making unnecessary trips to the memory pantry. It combines multiple math steps into one smooth, continuous flow, allowing AI models to run significantly faster and more efficiently, bringing us closer to having super-smart, lightning-fast AI assistants.