DiffuMamba: High-Throughput Diffusion LMs with Mamba Backbone

Imagine you are trying to write a novel, but you have two very different ways of doing it.

The Old Way: The "Autoregressive" Writer (The Slow Scribe)

Most AI models today work like a very careful scribe writing a story one word at a time. They look at everything they've written so far, think hard, and then write one single word. Then they stop, look at the whole story again, think, and write the next word.

The Problem: If you want a 100-page story, this scribe has to stop and think 50,000 times. It's slow, and the longer the story gets, the more "memory" (like a sticky note on their desk) they need to keep track of the beginning. Eventually, their desk gets so cluttered with notes that they can't move anymore.

The New Way: The "Diffusion" Artist (The Sculptor)

A newer type of AI, called a Diffusion Model, works differently. Imagine a sculptor who starts with a block of marble that is completely covered in fog (or "noise"). Instead of carving one word at a time, they look at the entire foggy block and try to guess what the whole statue should look like.

The Process: They wipe away some fog, guess the shape, wipe away more fog, and refine the whole thing simultaneously. They do this in many steps until the statue is clear.
The Benefit: They can fix mistakes easily (if a word is wrong, they just "re-fog" that spot and try again) and they can write many words at once.
The Catch: To do this, the sculptor usually uses a very complex, expensive tool (called a Transformer) that has to look at every single word in the foggy block and compare it to every other word. As the block gets bigger, this tool gets incredibly slow and expensive to run.

The Innovation: DiffuMamba (The High-Speed Sculptor)

The paper introduces DiffuMamba, which is like giving that Diffusion sculptor a brand new, super-fast tool called Mamba.

1. The Problem with the Old Tool

The old tool (Transformer) is like a librarian who has to run back and forth across a massive library to check if every book relates to every other book. If the library has 1,000 books, that's a lot of running. If it has 100,000 books, the librarian collapses from exhaustion. This is why current Diffusion models are slow on long texts.

2. The Mamba Solution

Mamba is a different kind of tool. Instead of running back and forth to compare everything, it's like a conveyor belt. It reads the story from left to right, and then right to left, keeping a running summary in its "head" as it goes.

The Analogy: Imagine reading a long email. The Transformer tries to remember every sentence and compare it to every other sentence. Mamba just reads the email, updates its understanding of the main point as it goes, and moves on. It doesn't need to re-read the beginning to understand the end.
The Result: This makes the process linear. Whether the story is 10 words or 100,000 words, the time it takes to process it grows steadily, not explosively.

3. The Hybrid Approach (DiffuMamba-H)

The researchers also tried a "Hybrid" version. They realized that while the conveyor belt (Mamba) is fast, sometimes you really need the librarian to double-check a specific detail. So, they built a system that uses the fast conveyor belt for most of the work, but stops every few steps to let the librarian do a quick, precise check. This gives you the best of both worlds: speed and high accuracy.

What Did They Find? (The Race Results)

The researchers built these new models and raced them against the old ones. Here is what happened:

Quality: The new models wrote just as well (or even better) than the old ones. They didn't lose any intelligence by switching tools.
Speed: This is where it got crazy.
- On short stories, they were roughly equal.
- On long stories (like 65,000 words), the new DiffuMamba model was 8 times faster than the old one.
- The Hybrid model was 4 times faster.
Memory: The old models needed a huge amount of computer memory to hold their "notes" as the story got longer. The new models kept their memory usage low and steady, like a runner who doesn't need to carry a backpack that gets heavier with every mile.

The Big Picture

Think of this as upgrading from a horse-drawn carriage (the old Transformer-based Diffusion) to a high-speed train (Mamba-based Diffusion).

The carriage is great for short trips, but for cross-country travel, it's slow and the horses get tired.
The train moves at a constant, fast speed regardless of the distance.

Why does this matter?
Currently, AI struggles with very long tasks (like summarizing a whole book or writing a complex legal contract) because it gets too slow and expensive. DiffuMamba shows that we can build AI that handles massive amounts of text quickly and efficiently, opening the door for AI to be used in real-time, long-form applications that were previously impossible.

In a nutshell: They swapped the "slow, heavy" brain of the AI for a "fast, efficient" one, allowing it to write long stories without getting tired or running out of memory.

1. Problem Statement

Diffusion Language Models (DLMs) offer a promising alternative to autoregressive (AR) generation by enabling parallel multi-token generation, self-correction, and flexible infilling. However, current DLMs face a critical efficiency bottleneck:

Quadratic Complexity: Existing DLMs rely on Transformer backbones with Multi-Head Attention (MHA). During the iterative denoising process, the model must re-encode the entire sequence at every step because token states evolve.
KV-Cache Overhead: While some variants attempt KV caching, the cache size grows linearly with sequence length, and the attention mechanism incurs quadratic compute costs ( $O(L^2)$ ). This leads to severe memory pressure and latency growth as sequence length increases, making long-context inference significantly slower than AR models.
The Paradox: DLMs promise flexible generation but are currently constrained by the same architectural limitations (quadratic attention) that limit Transformer-based AR models, often resulting in lower throughput than AR baselines on long sequences.

The paper asks: Can structured recurrence (State-Space Models) serve as an effective language denoiser to enable faster, linear-time inference for diffusion models?

2. Methodology

The authors propose DiffuMamba, a masked diffusion language model that replaces the standard Transformer encoder with a bidirectional Mamba-2 (BiMamba) backbone.

Core Architecture

Bidirectional Mamba-2: Unlike standard Mamba (which is causal/autoregressive), DiffuMamba uses two parallel Mamba layers: one processing the sequence forward and one backward. These are fused additively to provide symmetric context representation, essential for masked diffusion which conditions on both past and future tokens.
Linear-Time Scaling: By replacing attention with linear-time state-space operations, the model achieves $O(L)$ complexity per denoising step, eliminating the quadratic attention overhead.
Hybrid Variant (DiffuMamba-H): To capture complementary global dependencies that pure SSMs might miss, the authors introduce a hybrid architecture. This interleaves one Transformer block (MHA) every five Mamba blocks (approx. 20% attention). This design aims to balance the efficiency of Mamba with the global interaction capabilities of attention.

Training & Inference Strategy

Objective: The models are trained using the standard masked diffusion objective (reweighted cross-entropy over masked positions).
Decoding: The paper evaluates various decoding strategies, including full-sequence denoising and block-wise autoregressive denoising (similar to Fast-dLLM). In the block-wise approach, the model generates tokens in chunks, reusing KV caches across blocks to avoid recomputation.

3. Key Contributions

New Architectural Direction: Introduction of DiffuMamba, the first DLM to exclusively use bidirectional Mamba-2 mixers, and DiffuMamba-H, a hybrid variant. This demonstrates that iterative denoising does not inherently require dense attention.
Controlled Evaluation: A systematic comparison across three parameter scales (240M, 0.5B, and 1.3B) against a Transformer-based baseline (DiffuTran) under identical training data, tokenization, and noise schedules.
Comprehensive Throughput Benchmarking: An extensive analysis of inference efficiency scaling up to 262k tokens, combining asymptotic complexity analysis with empirical measurements on NVIDIA H100 GPUs.

4. Experimental Results

Modeling Quality (Perplexity & Accuracy)

Performance Parity/Improvement: Across all scales, DiffuMamba and DiffuMamba-H match or surpass the Transformer baseline (DiffuTran).
- At 1.3B parameters, DiffuMamba-H achieves the best overall performance, reducing perplexity by ~2% compared to DiffuTran on validation sets.
- At smaller scales (240M), DiffuMamba performs comparably to DiffuTran, while the hybrid variant shows slight improvements.
- Zero-Shot Benchmarks: On downstream tasks (reasoning, commonsense), DiffuMamba-H consistently outperforms DiffuTran by ~4% on average, suggesting that linear-time state-space modeling provides a robust backbone for diffusion.

Inference Throughput & Latency

Long-Context Scaling: The most significant gains appear in long-sequence scenarios ( $L > 2k$ $L > 2 k$ ).
- Full-Sequence Denoising: DiffuMamba achieves up to 8.2× higher throughput than DiffuTran, and DiffuMamba-H achieves 4.3× higher throughput at 65k tokens.
- Block-Autoregressive Denoising: When combined with block caching (reusing representations across blocks), DiffuMamba-H achieves a 2.3× throughput improvement over DiffuTran.
Latency Decomposition: Analysis of latency components reveals that DiffuTran's latency is dominated by a quadratic term ( $aL^2$ ) due to attention. In contrast, DiffuMamba's latency is dominated by linear and constant terms, resulting in a much slower degradation of throughput as sequence length increases.
Memory Efficiency: DiffuMamba is memory-bandwidth bound rather than compute-bound, allowing it to maintain high throughput where Transformer-based models suffer from memory traffic bottlenecks.

5. Significance and Conclusion

Breaking the Efficiency Paradox: The paper proves that DLMs can achieve high throughput without relying on quadratic attention. By leveraging Mamba backbones, DiffuMamba scales linearly with sequence length, solving the primary bottleneck of current diffusion language models.
Scalable Architecture: The results suggest that block-cached, Mamba-based diffusion is the most promising direction for future high-throughput generation systems, particularly for long-context reasoning tasks.
Hybrid Synergy: The success of DiffuMamba-H indicates that a small amount of attention (interleaved) can further enhance the global dependency modeling of SSMs without sacrificing the linear scaling benefits.

In summary, DiffuMamba establishes that replacing Transformer attention with bidirectional State-Space Models in diffusion language models is not only feasible but superior for long-context inference, offering a new path toward efficient, high-throughput generative AI.