MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

This paper presents MiniCPM-SALA, a 9B-parameter hybrid model that combines sparse and linear attention mechanisms with a cost-effective continual training framework to achieve efficient, high-performance long-context modeling up to 1M tokens while significantly reducing training costs and inference latency compared to traditional full-attention models.

MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Siyuan Liu, Hongya Lyu, Yinxu Pan, Shixin Ren, Xingyu Shen, Zhou Su, Haojun Sun, Yangang Sun, Zhen Leng Thai, Xin Tian, Rui Wang, Xiaorong Wang, Yudong Wang, Bo Wu, Xiaoyue Xu, Dong Xu, Shuaikang Xue, Jiawei Yang, Bowen Zhang, Jinqian Zhang, Letian Zhang, Shengnan Zhang, Xinyu Zhang, Xinyuan Zhang, Zhu Zhang, Hengyu Zhao, Jiacheng Zhao, Zhi Zheng, Jie Zhou, Zihan Zhou, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun

Published 2026-03-03
📖 4 min read☕ Coffee break read

🧠 The Big Problem: The "Library" Bottleneck

Imagine a super-smart librarian (an AI model) who can read books and answer questions.

  • The Old Way (Standard AI): To answer a question about a 1,000-page book, the librarian has to keep a mental note of every single sentence they've read so far. If the book gets longer (say, 1 million pages), the librarian's brain (memory) explodes. They can't hold all the notes, and they get so tired trying to cross-reference everything that they stop working entirely. This is the "Memory Wall" that stops most AI from reading huge documents.
  • The Current Fixes:
    • Sparse Attention: The librarian tries to only remember the "important" sentences. But they still need a massive filing cabinet to store the rest of the book just in case they need it later. It saves some brainpower, but the filing cabinet is still too heavy.
    • Linear Attention: The librarian summarizes the whole book into a single, tiny note. This fits in their pocket, but they lose the details. They can't find specific facts anymore.

🚀 The Solution: MiniCPM-SALA (The "Hybrid Librarian")

The MiniCPM-SALA team built a new kind of librarian who uses a hybrid strategy. Think of it as a team of two librarians working together in one body:

  1. The "Detail Detective" (Sparse Attention - 25%): This part is like a magnifying glass. It zooms in on specific, important sentences to make sure the AI doesn't miss the fine print. It's very accurate but a bit slow and memory-heavy.
  2. The "Speed Reader" (Linear Attention - 75%): This part is like a super-fast scanner. It glides over the whole document, remembering the general flow and big ideas without getting bogged down in details. It's incredibly fast and uses almost no memory.

The Magic Ratio: The model uses the "Speed Reader" for 75% of the work (to keep things fast and light) and the "Detail Detective" for 25% (to ensure accuracy). This mix gives you the best of both worlds: speed without losing the plot.

🛠️ How They Built It: The "Renovation" Trick

Usually, to build a new type of AI, you have to start from scratch, which is like building a new house from the ground up. It takes years and costs a fortune.

The MiniCPM-SALA team used a clever renovation strategy:

  • They took an existing, high-quality AI (MiniCPM-4.0) that was already a great "house."
  • Instead of tearing it down, they simply renovated the rooms. They swapped out the standard "memory rooms" for their new "Hybrid rooms."
  • The Result: They turned a standard house into a super-efficient hybrid house using only 25% of the time and money it would have taken to build a new one from scratch.

🏆 What Can It Do? (The Superpowers)

Because of this new design, MiniCPM-SALA can do things that other similar-sized AIs simply cannot:

  1. Read the Entire Internet (Almost): It can handle context lengths of 1 million tokens. To put that in perspective, that's like reading a 3,000-page novel, a massive legal contract, or a whole code repository in one go.
  2. Run on a Regular Computer: Most AIs need a massive, expensive server to read that much text. MiniCPM-SALA is so efficient that it can run on a single consumer graphics card (like the NVIDIA 5090 or A6000). Other models would crash (run out of memory) before they even finished reading the first chapter.
  3. Speed: At a length of 256,000 tokens, it is 3.5 times faster than its competitors. It's the difference between waiting an hour for a report and getting it in 15 minutes.
  4. Still Smart: Despite being so fast and efficient, it didn't lose its brainpower. It still scores highly on math, coding, and general knowledge tests, just like the big, slow models.

🌟 The Bottom Line

MiniCPM-SALA is like upgrading a car engine to be both a sports car (fast) and a truck (strong/capable). It solves the problem of "too much data" by mixing two different technologies, allowing us to run powerful AI on cheaper hardware and process massive amounts of information without the system crashing.

It proves you don't need a billion-dollar supercomputer to read a million-page book anymore; you just need the right kind of librarian.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →