M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

The Big Problem: The "Perfect Memory" vs. The "Fast Worker"

Imagine you are trying to build a super-smart AI assistant. You have two main types of workers to choose from:

The Transformers (The "Photographers"): These are the current champions (like the brains behind ChatGPT). They are amazing at looking at a whole photo of a scene at once and understanding how everything connects. They are fast at training because they can look at the whole photo simultaneously.
- The Flaw: They have a "short-term memory" limit. If you show them a 100-page book, they struggle to remember the details from page 1 when they are reading page 100. They also get very slow and expensive as the book gets longer.
The Linear RNNs (The "Speed Readers"): These are newer, faster workers. They read one word at a time and update a tiny "notepad" in their head. They are incredibly fast and memory-efficient, even for massive books.
- The Flaw: Their notepad is too small and rigid. They are great at simple tasks but terrible at complex logic, like tracking who owns which item in a story or debugging code. They are like a speed reader who can't do math.

The Goal: The researchers wanted to build a worker that has the speed of the Speed Reader but the brainpower of the Photographer.

The Solution: M2RNN (The "Matrix-Valued" Worker)

The paper introduces M2RNN (Matrix-to-Matrix Recurrent Neural Network). Here is how it works, using a simple analogy:

1. The Old Way: A Single Sticky Note

Traditional RNNs (the old speed readers) keep their memory on a single sticky note.

The Problem: If you try to write a whole story on one sticky note, it gets messy. You run out of space, and you have to erase old info to write new info. This is why they fail at complex tasks like tracking entities (e.g., "Who is the owner of the red car?").

2. The New Way: A Filing Cabinet (The Matrix)

M2RNN changes the game. Instead of a single sticky note, it gives the worker a filing cabinet (a matrix) to store its memory.

The Magic: It uses a special technique called an "outer product." Imagine instead of writing one word on a sticky note, you are stamping a whole page of a filing cabinet with a new piece of information.
The Result: The worker can store way more information without getting confused. It can track complex relationships (like a chess game or a code execution) that the old sticky-note workers simply couldn't handle.

3. The "Forget Gate" (The Smart Librarian)

Just like humans, AI needs to forget things to make room for new info. M2RNN has a "Forget Gate."

The Analogy: Imagine a librarian who decides what to keep on the shelf and what to throw away.
The Twist: In previous models, the librarian looked at the current book to decide what to throw away. In M2RNN, the librarian looks at the new book arriving and decides immediately. This makes the process faster and more efficient, allowing the system to run in parallel (like having many librarians working at once).

Why This Matters: The "Hybrid" Super-Worker

The researchers didn't just replace everything with M2RNN. They found that M2RNN is powerful but computationally "expensive" (it takes more energy to think).

So, they created a Hybrid Team:

The Team: They built a model that uses the fast "Speed Reader" (Linear RNN) for 90% of the work, but swaps in one "Filing Cabinet" worker (M2RNN) for the hardest parts of the job.
The Result:
- Better Memory: The model can remember details from the beginning of a 100-page book perfectly, even if it was trained on 10-page books.
- Better Logic: It gets much better at tasks requiring reasoning, like coding or tracking complex stories.
- Efficiency: Because they only use the "expensive" worker sparingly, the system stays fast.

The Real-World Wins

The paper tested this on two scales: a small model (410M parameters) and a large model (7 Billion parameters).

Language Modeling: It predicts the next word in a sentence better than almost any other non-Transformer model.
The "Needle in a Haystack" Test: Imagine hiding a specific sentence in a 100-page document and asking the AI to find it.
- Old models often missed the needle.
- M2RNN hybrids found the needle almost perfectly, even in very long documents.
Hardware Efficiency: The researchers built special software "kernels" (like custom tools for the computer's brain) that make sure the computer's graphics cards (GPUs) don't waste energy. They solved a problem where previous models wasted 75% of their computing power just by padding empty space.

Summary in One Sentence

M2RNN is a new type of AI brain that swaps a tiny, limited "sticky note" memory for a massive "filing cabinet" memory, allowing it to solve complex logic puzzles and remember long stories perfectly, while still being fast enough to run on standard computers.

It proves that you don't need to choose between being fast (like current efficient models) and being smart (like complex reasoning models); you can have both by mixing the right tools together.

1. Problem Statement

The paper addresses the limitations of current dominant architectures (Transformers) and emerging efficient alternatives (Linear RNNs/SSMs) in language modeling:

Transformers: While highly parallelizable, they are limited to the $TC^0$ complexity class, meaning they struggle with complex state-tracking tasks (e.g., entity tracking, code execution, permutation composition) that require greater expressive power. They also suffer from quadratic training complexity and linearly growing memory during inference.
Linear RNNs (e.g., Mamba, Gated DeltaNet): These offer linear time complexity and efficient inference but are provably less expressive than non-linear RNNs. They fail at hard state-tracking tasks and exhibit poor in-context retrieval performance because their fixed-rank state updates can overwrite key-value associations when the context exceeds capacity.
Traditional Non-Linear RNNs (e.g., LSTM, GRU): While theoretically capable of solving hard state-tracking tasks, they historically underperform in language modeling and long-context retrieval. The authors argue this is not due to non-linearity itself, but rather limited state size. Vector-valued states ( $h_t \in \mathbb{R}^d$ ) compress too much information compared to the matrix-valued states used in linear models. Furthermore, training non-linear RNNs is inefficient due to poor hardware utilization (inability to parallelize across sequence length and wasted FLOPs due to batch padding).

2. Methodology: Matrix-to-Matrix RNN (M2RNN)

The authors propose M2RNN, a non-linear RNN architecture designed to combine the expressivity of non-linear RNNs with the scalability and hardware efficiency of linear RNNs.

Core Architecture

Matrix-Valued Hidden States: Instead of a vector state, M2RNN maintains a matrix state $H_t \in \mathbb{R}^{K \times V}$ .
Outer Product State Expansion: The state update utilizes an outer product mechanism ( $k_t v_t^\top$ ) similar to linear attention and SSMs. This allows the model to store a massive amount of information (scaling with $K \times V$ ) without a proportional increase in parameter count.
Non-Linear Transition: The core recurrence is non-linear:
$Z_t = \tanh(H_{t-1}W + k_t v_t^\top)$
$H_t = f_t H_{t-1} + (1 - f_t)Z_t$
Where $W$ is a transition matrix, and $k_t, v_t$ are input-dependent projections.
Input-Independent Forget Gate: A scalar forget gate $f_t$ (per head) is computed solely from the input $x_t$ (using a parameterized sigmoid-like function), independent of the previous state. This allows the gate to be computed in parallel across the sequence, unlike standard LSTMs/GRUs.
Hybrid Integration: M2RNN is designed to be interleaved with Attention layers (Hybrid M2RNN) or other linear RNN layers (3-way hybrids) to balance expressivity and efficiency.

Systems Optimizations

Hardware Efficiency: The outer product formulation enables efficient use of Tensor Cores on NVIDIA GPUs. Unlike vector-valued RNNs that require batch padding to fit Tensor Core dimensions (wasting FLOPs), M2RNN's matrix dimensions ( $K, V$ ) can be set to multiples of 16, eliminating padding overhead.
Parallelism: The forward pass is parallelized across the batch dimension ( $B$ ) and the number of heads ( $N$ ), with each GPU Streaming Multiprocessor (SM) handling an independent recurrence.
Tensor Parallelism (TP): The paper introduces two TP strategies:
1. Topology-Aware: Uses a grouped-value formulation where query/key heads match the TP world size. Requires no extra communication but ties parameter count to the specific TP configuration.
2. Topology-Independent: Preserves parameter counts regardless of TP size by sharding RMSNorm weights and using extra AllReduce operations for synchronization.

3. Key Contributions

M2RNN Architecture: A novel non-linear RNN with matrix-valued states that achieves perfect state tracking generalization on unseen sequence lengths, surpassing linear RNNs and Transformers in expressivity.
State Size as the Bottleneck: Empirical evidence demonstrating that the poor performance of traditional non-linear RNNs is primarily due to small state sizes, not non-linearity. Expanding the state size via outer products significantly boosts language modeling and retrieval.
Hybrid Efficiency: Showing that replacing just one linear RNN layer with an M2RNN layer in a hybrid architecture yields accuracy gains comparable to a full M2RNN model, with minimal impact on training throughput (<6% degradation).
Systems Implementation: Custom Triton kernels and Tensor Parallelism strategies that enable efficient training of non-linear RNNs at scale, avoiding the FLOP waste associated with previous non-linear RNN implementations (e.g., FlashRNN).

4. Experimental Results

The models were evaluated on 410M dense and 7B (1.1B active) MoE configurations trained on 100B tokens.

Language Modeling:
- Homogeneous: M2RNN matches Mamba-2 and Gated DeltaNet in perplexity while significantly outperforming standard RNNs/GRUs.
- Hybrid: Hybrid M2RNN outperforms Hybrid Mamba-2 and Hybrid Gated DeltaNet by 0.4–0.5 perplexity points on 7B MoE models.
State Tracking: M2RNN achieves >99.5% accuracy on the $S_3$ permutation group task and generalizes perfectly to sequence lengths (up to 512) not seen during training (128), whereas linear RNNs (Gated DeltaNet) fail to generalize.
In-Context Retrieval (RULER & Real-World):
- Hybrid M2RNN outperforms Hybrid Gated DeltaNet by up to 8 points on LongBench.
- In real-world retrieval tasks (SQuAD, NQ, etc.), Hybrid M2RNN achieves the highest scores, surpassing even Transformer++ baselines in specific hybrid configurations.
Throughput: While pure M2RNN is slower than linear baselines due to non-linear operations, the Hybrid Gated DeltaNet + M2RNN-1 configuration maintains throughput within 6% of the baseline while delivering superior accuracy.

5. Significance

This paper establishes non-linear RNN layers as a viable and compelling building block for scalable language models. It challenges the notion that non-linearity is inherently incompatible with efficiency. By solving the state-size bottleneck and optimizing for modern hardware (Tensor Cores), M2RNN offers a path forward for models that require:

High Expressivity: To handle complex reasoning, code execution, and state tracking.
Long-Context Generalization: To retrieve information from distant contexts better than linear RNNs.
Scalability: To be trained efficiently on large clusters without the quadratic memory cost of Transformers.

The work suggests a future architecture paradigm where hybrid models (interleaving Attention, Linear RNNs, and Non-Linear RNNs) provide the optimal balance of speed, memory, and reasoning capability.

M2^22RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

The Big Problem: The "Perfect Memory" vs. The "Fast Worker"

The Solution: M2RNN (The "Matrix-Valued" Worker)

1. The Old Way: A Single Sticky Note

2. The New Way: A Filing Cabinet (The Matrix)

3. The "Forget Gate" (The Smart Librarian)

Why This Matters: The "Hybrid" Super-Worker

The Real-World Wins

Summary in One Sentence

1. Problem Statement

2. Methodology: Matrix-to-Matrix RNN (M2RNN)

Core Architecture

Systems Optimizations

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Diffusion-Attention Connection

Fairboard: a quantitative framework for equity assessment of healthcare models

Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Human-like Working Memory Interference in Large Language Models

Belief-State RWKV for Reinforcement Learning under Partial Observability

M $^2$ RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling