MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Siyuan Liu, Hongya Lyu, Yinxu Pan, Shixin Ren, Xingyu Shen, Zhou Su, Haojun Sun, Yangang Sun, Zhen Leng Thai, Xin Tian, Rui Wang, Xiaorong Wang, Yudong Wang, Bo Wu, Xiaoyue Xu, Dong Xu, Shuaikang Xue, Jiawei Yang, Bowen Zhang, Jinqian Zhang, Letian Zhang, Shengnan Zhang, Xinyu Zhang, Xinyuan Zhang, Zhu Zhang, Hengyu Zhao, Jiacheng Zhao, Zhi Zheng, Jie Zhou, Zihan Zhou, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun

Published 2026-03-03

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

🧠 The Big Problem: The "Library" Bottleneck

Imagine a super-smart librarian (an AI model) who can read books and answer questions.

The Old Way (Standard AI): To answer a question about a 1,000-page book, the librarian has to keep a mental note of every single sentence they've read so far. If the book gets longer (say, 1 million pages), the librarian's brain (memory) explodes. They can't hold all the notes, and they get so tired trying to cross-reference everything that they stop working entirely. This is the "Memory Wall" that stops most AI from reading huge documents.
The Current Fixes:
- Sparse Attention: The librarian tries to only remember the "important" sentences. But they still need a massive filing cabinet to store the rest of the book just in case they need it later. It saves some brainpower, but the filing cabinet is still too heavy.
- Linear Attention: The librarian summarizes the whole book into a single, tiny note. This fits in their pocket, but they lose the details. They can't find specific facts anymore.

🚀 The Solution: MiniCPM-SALA (The "Hybrid Librarian")

The MiniCPM-SALA team built a new kind of librarian who uses a hybrid strategy. Think of it as a team of two librarians working together in one body:

The "Detail Detective" (Sparse Attention - 25%): This part is like a magnifying glass. It zooms in on specific, important sentences to make sure the AI doesn't miss the fine print. It's very accurate but a bit slow and memory-heavy.
The "Speed Reader" (Linear Attention - 75%): This part is like a super-fast scanner. It glides over the whole document, remembering the general flow and big ideas without getting bogged down in details. It's incredibly fast and uses almost no memory.

The Magic Ratio: The model uses the "Speed Reader" for 75% of the work (to keep things fast and light) and the "Detail Detective" for 25% (to ensure accuracy). This mix gives you the best of both worlds: speed without losing the plot.

🛠️ How They Built It: The "Renovation" Trick

Usually, to build a new type of AI, you have to start from scratch, which is like building a new house from the ground up. It takes years and costs a fortune.

The MiniCPM-SALA team used a clever renovation strategy:

They took an existing, high-quality AI (MiniCPM-4.0) that was already a great "house."
Instead of tearing it down, they simply renovated the rooms. They swapped out the standard "memory rooms" for their new "Hybrid rooms."
The Result: They turned a standard house into a super-efficient hybrid house using only 25% of the time and money it would have taken to build a new one from scratch.

🏆 What Can It Do? (The Superpowers)

Because of this new design, MiniCPM-SALA can do things that other similar-sized AIs simply cannot:

Read the Entire Internet (Almost): It can handle context lengths of 1 million tokens. To put that in perspective, that's like reading a 3,000-page novel, a massive legal contract, or a whole code repository in one go.
Run on a Regular Computer: Most AIs need a massive, expensive server to read that much text. MiniCPM-SALA is so efficient that it can run on a single consumer graphics card (like the NVIDIA 5090 or A6000). Other models would crash (run out of memory) before they even finished reading the first chapter.
Speed: At a length of 256,000 tokens, it is 3.5 times faster than its competitors. It's the difference between waiting an hour for a report and getting it in 15 minutes.
Still Smart: Despite being so fast and efficient, it didn't lose its brainpower. It still scores highly on math, coding, and general knowledge tests, just like the big, slow models.

🌟 The Bottom Line

MiniCPM-SALA is like upgrading a car engine to be both a sports car (fast) and a truck (strong/capable). It solves the problem of "too much data" by mixing two different technologies, allowing us to run powerful AI on cheaper hardware and process massive amounts of information without the system crashing.

It proves you don't need a billion-dollar supercomputer to read a million-page book anymore; you just need the right kind of librarian.

1. Problem Statement

The rapid evolution of Large Language Models (LLMs) toward ultra-long context applications (e.g., repository-scale code analysis, multi-day agent collaboration, and full-document understanding) is hindered by the computational and memory bottlenecks of the standard Transformer architecture.

Computational Bottleneck: Standard full-attention mechanisms have a quadratic complexity ( $O(N^2)$ ) relative to sequence length, causing inference latency to explode as context scales to millions of tokens.
Memory Bottleneck: The Key-Value (KV) Cache required for auto-regressive generation grows linearly with sequence length. For an 8B parameter model, processing millions of tokens can require hundreds of gigabytes of VRAM, leading to Out-Of-Memory (OOM) errors even on high-end GPUs.
Limitations of Existing Solutions:
- Sparse Attention: Reduces computation but often retains full KV-Cache ("sparse computation, dense storage"), failing to solve the memory bottleneck.
- Linear Attention: Reduces complexity to $O(N)$ and memory usage but typically suffers from information loss and performance degradation due to lossy compression.

2. Methodology

MiniCPM-SALA is a 9B-parameter hybrid architecture designed to balance high-fidelity long-context modeling with global efficiency.

A. Hybrid Architecture

The model integrates two distinct attention mechanisms in a 1:3 ratio (25% Sparse, 75% Linear):

Sparse Attention (25%): Utilizes InfLLM-V2. This mechanism selects salient portions of the attention matrix to maintain high-fidelity modeling of long-range dependencies without adding extra parameters. It is crucial for preserving semantic accuracy.
Linear Attention (75%): Utilizes Lightning Attention. This mechanism employs recurrent formulations to achieve $O(N)$ computational and memory complexity, enabling efficient processing of massive context windows.
Layer Selection: Instead of random interleaving, a layer selection algorithm determines the placement of sparse layers to optimize downstream performance.
Hybrid Positional Encoding (HyPE):
- Linear Layers: Use Rotary Positional Embedding (RoPE) to maintain relative token order within the global context.
- Sparse Layers: Remove RoPE to prevent the decay of long-distance information, enabling precise recall over extended sequences.
Additional Improvements: The architecture includes QK-Normalization (to prevent activation spikes) and Output Gates (to mitigate attention sinks and regulate information flow).

B. Training Strategy: Transformer-to-Hybrid Conversion

Rather than training from scratch, the authors employ a continual training framework to convert a pre-trained Transformer (MiniCPM-4.0) into a hybrid model. This approach reduces training costs by approximately 75% compared to training a comparable model from scratch.
The process involves five stages:

Architecture Conversion (HALO): Converts the full-attention model to a hybrid structure. Only linear layers are trainable; sparse layers inherit dense weights.
Continual Stable-Training: Aligns the new linear layers with existing components (FFN, embeddings) using 4K context lengths.
Short-Decay Training: Extensive training on high-quality data (PDFs, synthetic data) to compress knowledge and enhance reasoning.
Long-Decay Training: Gradually increases context length from 4K to 520K tokens. Sparse attention is enabled here to learn the synergy between mechanisms.
Supervised Fine-Tuning (SFT): Focuses on reasoning, coding, and long-context retrieval tasks with context lengths up to 140K.

3. Key Contributions

Hybrid Attention Mechanism: Successfully integrates InfLLM-V2 and Lightning Attention to achieve a balance between throughput and precision, maintaining high semantic accuracy as sequence length scales.
Cost-Effective Training Paradigm: Demonstrates that converting a pre-trained Transformer to a hybrid model is a highly efficient strategy, reducing the training budget to ~25% of the cost of training from scratch while matching full-attention performance.
Superior Long-Context Performance: Introduces HyPE and specific architectural tweaks to harmonize short and long-context performance, outperforming full-attention baselines in long-context benchmarks.
Edge-Ready Efficiency: Proves that 1M-token context processing is feasible on consumer-grade and edge GPUs, democratizing ultra-long context applications.

4. Experimental Results

General Capabilities

MiniCPM-SALA (9B) achieves an average score of 76.53 on standard benchmarks (CMMLU, MMLU-Pro, HumanEval, AIME), comparable to or exceeding full-attention models like Qwen3-8B and Falcon-H1R-7B.
It demonstrates strong performance in coding (95.12 on HumanEval) and math (83.75 on AIME24), proving that hybridization does not degrade general intelligence.

Long-Context Benchmarks

RULER (128K): Achieves 89.37, significantly outperforming baselines (e.g., Qwen3-8B at 71.74).
NoLiMa (128K): Scores 23.86, substantially higher than competitors.
Ultra-Long Extrapolation: Despite being trained up to 520K, the model successfully extrapolates to 2M (2048K) tokens without auxiliary techniques (like YaRN), maintaining a score of 81.6. It even outperforms the 80B-parameter Qwen3-Next-80B at the 1M token mark (86.3 vs. 80.3).

Inference Speed and Memory Efficiency

Speed: On an NVIDIA A6000D GPU, MiniCPM-SALA is up to 3.5× faster than Qwen3-8B at 256K tokens.
Memory:
- A6000D: Qwen3-8B fails (OOM) at 512K/1024K tokens, while MiniCPM-SALA successfully processes 1M tokens.
- RTX 5090 (Consumer GPU): Qwen3-8B fails at 128K (non-quantized), whereas MiniCPM-SALA scales to 1M tokens without OOM errors.

5. Significance

MiniCPM-SALA represents a significant leap in making ultra-long context modeling accessible and efficient. By solving the memory bottleneck of full-attention models and the accuracy bottleneck of linear models, it enables:

Scalability: Processing contexts up to 1M tokens on single, consumer-grade GPUs.
Cost Reduction: Drastically lowering the computational cost of developing long-context models via the conversion paradigm.
Real-World Application: Facilitating complex, information-intensive tasks such as full-repository code engineering, deep document analysis, and long-horizon agent planning on edge devices.

The paper establishes that hybrid architectures, when combined with efficient training strategies, can match the performance of massive full-attention models while offering superior efficiency and scalability.