PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

This paper introduces PIM-SHERPA, a software-only method that resolves memory attribute and layout inconsistencies in product-level PIM-enabled systems to enable efficient on-device LLM inference, achieving significant memory capacity savings while maintaining near-theoretical performance.

Sunjung Lee, Sanghoon Cha, Hyeonsu Kim, Seungwoo Seo, Yuhwan Ro, Sukhan Lee, Byeongho Kim, Yongjun Park, Kyomin Sohn, Seungwon Lee, Jaehoon Yu

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you have a brilliant, super-smart assistant (the Large Language Model or LLM) living inside your smartphone. This assistant is great at two things:

  1. Reading a long story quickly (the Prefill phase).
  2. Chatting with you, one word at a time (the Decode phase).

To make this assistant super fast, the phone uses a special trick called PIM (Processing-In-Memory). Think of PIM as a magical library where the books (data) can be read and processed right on the shelf, so you don't have to walk all the way to the front desk (the main processor) every time. This is incredibly fast for chatting (Decode).

The Problem: The "Two-Faced" Assistant

Here is the catch: The assistant has a split personality, and the phone's memory system doesn't know how to handle it.

  • For Reading (Prefill): The assistant wants to sit in a VIP Lounge (the Cache). This is a fast, cozy spot where it can grab the same book over and over again without leaving the room. If the books aren't in the VIP Lounge, it gets slow.
  • For Chatting (Decode): The assistant needs to go to the Back Alley (the Non-Cacheable region). Why? Because the PIM magic only works if the assistant physically walks to the shelf to grab the book. If the book is already in the VIP Lounge, the assistant stays there, and the PIM magic never happens.

The Conflict:

  • If you put the books in the VIP Lounge, the chat is slow because PIM doesn't trigger.
  • If you put the books in the Back Alley, the reading is slow because the assistant can't reuse the books efficiently.

The Old Solution (The "Double Trouble"):
Previously, engineers tried to solve this by buying two copies of every book. One copy went to the VIP Lounge for reading, and one copy went to the Back Alley for chatting.

  • The Downside: This doubled the space needed. Your phone would run out of memory (RAM) instantly, forcing you to delete photos or apps just to run the AI.

The Solution: PIM-SHERPA

The researchers created PIM-SHERPA, a clever software method that solves this without buying extra books. Think of PIM-SHERPA as a super-efficient Concierge Service with two different strategies depending on how long your conversation is.

Strategy 1: The "Double Buffer" (DDB)

Best for: Long conversations where you are typing fast.

Imagine a conveyor belt system in a factory.

  1. While the worker (the processor) is busy building a toy (doing math) using the books in the VIP Lounge (Buffer A)...
  2. A helper (the copy thread) is simultaneously running to the Back Alley, grabbing the next set of books, and shuffling them into a second VIP Lounge (Buffer B).
  3. By the time the worker finishes the first toy, the second set of books is already waiting in Buffer B.

The Magic: The time it takes to run to the Back Alley is completely "hidden" because it happens while the worker is busy. You get the speed of the VIP Lounge for reading and the PIM speed for chatting, without needing two full libraries.

Strategy 2: The "Just-in-Time" Delivery (OWR)

Best for: Very long, complex inputs (like pasting a whole novel into the chat).

Imagine you are cooking a massive meal. Instead of preparing everything at once, you just grab the ingredients you need right before you start chopping.

  1. The system waits until the very last second before the math starts.
  2. It quickly grabs the specific books needed from the Back Alley, shuffles them into the VIP Lounge, and starts cooking.
  3. Because the "cooking" (math) takes so long for huge inputs, the few seconds it takes to grab the ingredients don't matter much.

The Magic: This is simpler to build and works perfectly when the task is huge.

Why This Matters

  1. Saves Space: PIM-SHERPA saves about 48% of your phone's memory. Instead of needing two copies of the AI, you only need one. This means you can run smarter, more powerful AI models on your phone without deleting your photo gallery.
  2. No New Hardware: You don't need to buy a new phone. This is just a software update that makes your current phone's memory work smarter.
  3. Speed: It keeps the chat fast (thanks to PIM) and the reading fast (thanks to the VIP Lounge), solving the "split personality" problem of the AI.

The Bottom Line

Before PIM-SHERPA, running smart AI on phones was like trying to fit a double-decker bus into a single-car garage—you had to cut the bus in half (use a weaker AI) or build a bigger garage (buy a new phone).

PIM-SHERPA is like a magical folding mechanism that lets the bus fit perfectly into the garage, using the space efficiently so you can drive the full-size bus right out of the box. It makes on-device AI faster, cheaper, and more practical for everyone.