DiffuMamba: High-Throughput Diffusion LMs with Mamba Backbone

The paper introduces DiffuMamba, a high-throughput diffusion language model leveraging a bidirectional Mamba backbone (and a hybrid variant) that achieves Transformer-level performance while significantly improving inference efficiency and scaling linearly with sequence length.

Vaibhav Singh, Oleksiy Ostapenko, Pierre-André Noël, Eugene Belilovsky, Torsten Scholak

Published 2026-03-02
📖 5 min read🧠 Deep dive

Imagine you are trying to write a novel, but you have two very different ways of doing it.

The Old Way: The "Autoregressive" Writer (The Slow Scribe)

Most AI models today work like a very careful scribe writing a story one word at a time. They look at everything they've written so far, think hard, and then write one single word. Then they stop, look at the whole story again, think, and write the next word.

  • The Problem: If you want a 100-page story, this scribe has to stop and think 50,000 times. It's slow, and the longer the story gets, the more "memory" (like a sticky note on their desk) they need to keep track of the beginning. Eventually, their desk gets so cluttered with notes that they can't move anymore.

The New Way: The "Diffusion" Artist (The Sculptor)

A newer type of AI, called a Diffusion Model, works differently. Imagine a sculptor who starts with a block of marble that is completely covered in fog (or "noise"). Instead of carving one word at a time, they look at the entire foggy block and try to guess what the whole statue should look like.

  • The Process: They wipe away some fog, guess the shape, wipe away more fog, and refine the whole thing simultaneously. They do this in many steps until the statue is clear.
  • The Benefit: They can fix mistakes easily (if a word is wrong, they just "re-fog" that spot and try again) and they can write many words at once.
  • The Catch: To do this, the sculptor usually uses a very complex, expensive tool (called a Transformer) that has to look at every single word in the foggy block and compare it to every other word. As the block gets bigger, this tool gets incredibly slow and expensive to run.

The Innovation: DiffuMamba (The High-Speed Sculptor)

The paper introduces DiffuMamba, which is like giving that Diffusion sculptor a brand new, super-fast tool called Mamba.

1. The Problem with the Old Tool

The old tool (Transformer) is like a librarian who has to run back and forth across a massive library to check if every book relates to every other book. If the library has 1,000 books, that's a lot of running. If it has 100,000 books, the librarian collapses from exhaustion. This is why current Diffusion models are slow on long texts.

2. The Mamba Solution

Mamba is a different kind of tool. Instead of running back and forth to compare everything, it's like a conveyor belt. It reads the story from left to right, and then right to left, keeping a running summary in its "head" as it goes.

  • The Analogy: Imagine reading a long email. The Transformer tries to remember every sentence and compare it to every other sentence. Mamba just reads the email, updates its understanding of the main point as it goes, and moves on. It doesn't need to re-read the beginning to understand the end.
  • The Result: This makes the process linear. Whether the story is 10 words or 100,000 words, the time it takes to process it grows steadily, not explosively.

3. The Hybrid Approach (DiffuMamba-H)

The researchers also tried a "Hybrid" version. They realized that while the conveyor belt (Mamba) is fast, sometimes you really need the librarian to double-check a specific detail. So, they built a system that uses the fast conveyor belt for most of the work, but stops every few steps to let the librarian do a quick, precise check. This gives you the best of both worlds: speed and high accuracy.


What Did They Find? (The Race Results)

The researchers built these new models and raced them against the old ones. Here is what happened:

  • Quality: The new models wrote just as well (or even better) than the old ones. They didn't lose any intelligence by switching tools.
  • Speed: This is where it got crazy.
    • On short stories, they were roughly equal.
    • On long stories (like 65,000 words), the new DiffuMamba model was 8 times faster than the old one.
    • The Hybrid model was 4 times faster.
  • Memory: The old models needed a huge amount of computer memory to hold their "notes" as the story got longer. The new models kept their memory usage low and steady, like a runner who doesn't need to carry a backpack that gets heavier with every mile.

The Big Picture

Think of this as upgrading from a horse-drawn carriage (the old Transformer-based Diffusion) to a high-speed train (Mamba-based Diffusion).

  • The carriage is great for short trips, but for cross-country travel, it's slow and the horses get tired.
  • The train moves at a constant, fast speed regardless of the distance.

Why does this matter?
Currently, AI struggles with very long tasks (like summarizing a whole book or writing a complex legal contract) because it gets too slow and expensive. DiffuMamba shows that we can build AI that handles massive amounts of text quickly and efficiently, opening the door for AI to be used in real-time, long-form applications that were previously impossible.

In a nutshell: They swapped the "slow, heavy" brain of the AI for a "fast, efficient" one, allowing it to write long stories without getting tired or running out of memory.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →