Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

Here is an explanation of the paper "Compiler-First State Space Duality and Portable O(1) Autoregressive Caching for Inference," translated into everyday language with creative analogies.

The Big Problem: The "Specialized Tool" Trap

Imagine you have a incredibly powerful, high-tech robot (an AI model called Mamba-2) that can write stories, solve math problems, and chat with you.

However, there's a catch: To make this robot move fast, the original creators built it using specialized, custom-made wrenches that only fit NVIDIA brand engines (their GPUs). If you try to run this robot on a different engine (like Google's TPUs, Apple's chips, or even a standard computer CPU), the wrenches don't fit. The robot either moves incredibly slowly or doesn't work at all.

This creates a "lock-in" problem. If you want to use this advanced AI, you must buy expensive NVIDIA hardware.

The Solution: The "Universal Translator"

The author, Cosmo Santoni, asked a simple question: "Do we really need these custom wrenches, or can we just teach the robot to use standard tools?"

The answer is yes. The paper shows that Mamba-2's internal logic is actually very "neat" and organized. It doesn't need custom hardware tricks; it just needs a smart compiler (a software translator that turns code into machine instructions) to organize the work efficiently.

The author built a version of Mamba-2 that uses standard, universal tools (called XLA primitives). This means the same code runs perfectly on:

🖥️ CPUs (Standard computers)
🎮 NVIDIA GPUs (The original target)
☁️ Google TPUs (Cloud supercomputers)

The Analogy: Instead of building a custom engine for every car brand, the author built a universal adapter. Now, the AI can drive on any road, in any country, without needing a mechanic to rebuild the engine every time.

How It Works: The "Library" vs. The "Notebook"

To understand why this is a big deal, we need to look at how AI models remember things while they talk.

1. The Old Way (Transformers): The "Growing Notebook"

Most AI models (like the ones powering early Chatbots) work like a student taking notes in a notebook.

Every time the AI says a new word, it writes it down in the notebook.
To understand the next word, it has to flip back through every single page of the notebook to see what was said before.
The Problem: As the conversation gets longer, the notebook gets huge. The student spends more time flipping pages than thinking. This is slow and uses a lot of memory.

2. The New Way (Mamba-2): The "Magic Summary"

Mamba-2 is different. It doesn't keep a notebook. Instead, it keeps a single, magical summary card in its pocket.

Every time a new word comes in, the AI updates this one card instantly.
The size of the card never changes, no matter if the conversation is 10 words or 10,000 words.
The Benefit: This is called O(1) Caching. "O(1)" is math-speak for "constant time." It means the speed stays the same whether the story is short or long.

The Paper's Breakthrough:
Previous versions of Mamba-2 were so fast that they required those custom NVIDIA wrenches to keep the "Magic Summary" card updating quickly. The author proved that if you organize the math correctly (using "static masks" instead of "dynamic loops"), a standard compiler can update that card just as fast, without needing the custom hardware.

The Three "Secret Ingredients"

The author didn't just remove the custom tools; they had to rearrange the kitchen to make the standard tools work efficiently. Here are the three tricks used:

Chunking (The Assembly Line):
Instead of processing words one by one (which is slow), the AI processes them in small groups (chunks) of 256 words at a time. It's like a factory assembly line where 256 cars are painted simultaneously, rather than one by one.
Static Masks (The Traffic Light):
In AI, you often need to say, "Only look at the words before this one, not the ones after."
- Old way: "Stop! Check if this is the right word. Stop! Check again." (This breaks the flow).
- New way: A pre-printed traffic light map that says "Green for these, Red for those." The compiler sees the map and builds the whole road at once without stopping to check.
The "On-Device" Loop (The Internal Monologue):
Usually, when an AI generates text, the computer has to ask the main processor (the Host), "Okay, what's the next word?" and wait for an answer. This is like a chef asking the owner for permission to chop an onion for every single chop.
The author made the AI think entirely inside the machine. The chef chops, cooks, and plates the whole meal without ever asking the owner. This eliminates the "wait time" between words.

The Results: Fast, Portable, and Accurate

The paper tested this new approach on Google's super-fast TPU chips and NVIDIA GPUs.

Speed: It reached about 64% of the maximum possible speed for reading/writing data (Bandwidth Utilization). That is incredibly efficient for a system that doesn't use custom hardware.
Portability: The exact same code ran on a standard laptop CPU, a high-end GPU, and a cloud TPU. No changes needed.
Accuracy: The AI wrote the exact same words as the original, custom-built version. The "Magic Summary" card was updated perfectly.

The Bottom Line

This paper is a victory for openness and flexibility.

It proves that you don't need to be tied to one specific hardware company (NVIDIA) to run the most advanced AI models. By using smart software engineering and letting the compiler do the heavy lifting, we can run these models anywhere, faster, and cheaper.

In short: The author took a high-performance race car that only worked on a specific track, and tuned the engine so it can race on any track, just as fast, using standard parts.

Here is a detailed technical summary of the paper "Compiler-First State Space Duality and Portable O(1) Autoregressive Caching for Inference" by Cosmo Santoni.

1. Problem Statement

State-space models (SSMs), particularly Mamba-2, have achieved state-of-the-art performance in sequence modeling. However, their standard implementations are tightly coupled with fused CUDA and Triton kernels. This creates a "hard dependency" on NVIDIA hardware, making deployment difficult on other accelerators like Google TPUs, AMD GPUs, or CPUs.

The Bottleneck: Existing JAX/TPU ports of Mamba-2 either lack autoregressive caching, rely on unoptimized paths, or require custom kernel development.
The Goal: To demonstrate that the algebraic properties of Mamba-2's State Space Duality (SSD) algorithm allow it to be implemented using standard compiler primitives (XLA) without hand-written kernels, achieving portability across CPU, GPU, and TPU while maintaining theoretical $O(1)$ inference complexity.

2. Methodology: The Compiler-First Approach

The paper argues that Mamba-2's SSD algorithm is naturally suited for compiler optimization (specifically XLA) due to its structural properties:

Diagonal State Structure: Allows for analytic unrolling of the recurrence.
Chunkable Recurrence: The sequence is split into fixed-size chunks ( $L=256$ ), turning sequential dependencies into parallel matrix computations within chunks.
Einsum-Dominated Compute: Heavy computation is expressed as batched contractions (einsums) with static control flow, which XLA can tile and fuse efficiently.

Key Implementation Strategies

To realize this on XLA without custom kernels, the author employs specific engineering choices:

Static Masking vs. Dynamic Control Flow: Instead of using runtime loops (e.g., fori_loop) for causal masking (lower-triangular structure), the implementation uses static masks (jnp.tril). This allows XLA to fuse the mask into surrounding element-wise operations rather than breaking the fusion graph.
Compiled On-Device Loops: Autoregressive decoding is implemented using jax.lax.fori_loop running entirely on the device. This avoids the $O(1)$ penalty of host-device round-trips incurred by Python loops, which is critical for small-to-medium models.
Precision Management:
- Float32 Upcasting: Decay parameters are exponentiated in float32 (even if the model uses bfloat16) to prevent underflow and accumulation drift.
- Residuals: Residual connections are maintained in float32 to ensure numerical stability across deep layers.
$O(1)$ Caching: The state is stored as a JAX PyTree (Mamba2Cache). This allows the compiler to trace the state update within a compiled loop, keeping the cache entirely on-device without host synchronization.

3. Key Contributions

A Compiler-First SSD Pattern: A validated implementation pattern showing that SSD algorithms can be efficiently compiled using standard primitives (shaping, masking, einsums) without custom kernels.
Kernel-Free $O(1)$ Caching: The first JAX implementation of Mamba-2 that achieves true $O(1)$ state updates via compiled on-device control flow. It runs unmodified on CPU, NVIDIA GPU, and Google Cloud TPU.
Performance Evidence: Empirical data demonstrating that XLA-generated code can reach significant hardware utilization (up to 15% MFU on prefill and 64% HBU on decode) on TPU v6e, rivaling custom kernel performance.
Open Source Release: The implementation is merged into the Bonsai JAX model library and is publicly available.

4. Experimental Results

The implementation was evaluated on Google Cloud TPU v6e and NVIDIA A100 across five model scales (130M to 2.7B parameters).

Throughput & Latency:
- Prefill: On TPU v6e, the 2.7B model reached ~140 TFLOPS (15% MFU) on single-stream prefill.
- Decode: Achieved up to 64% Hardware Bandwidth Utilization (HBU) on TPU v6e.
- Scaling: The cached implementation maintains constant throughput regardless of sequence length, whereas non-cached baselines degrade quadratically.
Memory Efficiency:
- $O(1)$ Memory: The cached implementation maintains constant peak memory usage (e.g., ~10.9 GB for the 2.7B model) regardless of sequence length.
- Comparison: A non-cached baseline for the 2.7B model at 4096 tokens consumed >16 GB, demonstrating the memory advantage of the cache.
Portability: The exact same source code ran on TPU v6e, NVIDIA A100, and x86-64 CPU.
Numerical Correctness: Greedy decoding matched the PyTorch/CUDA reference token-for-token over 64 steps. Hidden states agreed within float32 rounding tolerance ($1 \times 10^{-5} $relative,$ 2 \times 10^{-4}$ absolute).
Ablation Studies:
- Removing static masking (replacing with dynamic loops) caused an 82.8% drop in prefill throughput.
- Using bfloat16 for decay exponentiation (instead of float32) caused significant numerical divergence.

5. Significance and Conclusion

This work fundamentally shifts the paradigm for SSM deployment:

Decoupling Architecture from Hardware: It proves that the performance benefits of Mamba-2 do not strictly require hand-tuned CUDA/Triton kernels. The algorithm's algebraic structure is sufficient for high-performance compilation.
Portability: It removes the barrier to entry for deploying SSMs on non-NVIDIA hardware (like TPUs and AMD GPUs), enabling a single codebase to target diverse hardware ecosystems.
Compiler Maturity: It highlights the maturity of the XLA compiler, showing that with the right algorithmic mapping (static masks, chunking, precision control), compilers can achieve near-peak hardware efficiency for complex recurrent models.

In summary, the paper demonstrates that custom kernels are now optional, not required, for efficient Mamba-2 inference, provided the implementation adheres to specific compiler-friendly structural constraints.

Compiler-First State Space Duality and Portable O(1)O(1)O(1) Autoregressive Caching for Inference