PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

Imagine you have a brilliant, super-smart assistant (the Large Language Model or LLM) living inside your smartphone. This assistant is great at two things:

Reading a long story quickly (the Prefill phase).
Chatting with you, one word at a time (the Decode phase).

To make this assistant super fast, the phone uses a special trick called PIM (Processing-In-Memory). Think of PIM as a magical library where the books (data) can be read and processed right on the shelf, so you don't have to walk all the way to the front desk (the main processor) every time. This is incredibly fast for chatting (Decode).

The Problem: The "Two-Faced" Assistant

Here is the catch: The assistant has a split personality, and the phone's memory system doesn't know how to handle it.

For Reading (Prefill): The assistant wants to sit in a VIP Lounge (the Cache). This is a fast, cozy spot where it can grab the same book over and over again without leaving the room. If the books aren't in the VIP Lounge, it gets slow.
For Chatting (Decode): The assistant needs to go to the Back Alley (the Non-Cacheable region). Why? Because the PIM magic only works if the assistant physically walks to the shelf to grab the book. If the book is already in the VIP Lounge, the assistant stays there, and the PIM magic never happens.

The Conflict:

If you put the books in the VIP Lounge, the chat is slow because PIM doesn't trigger.
If you put the books in the Back Alley, the reading is slow because the assistant can't reuse the books efficiently.

The Old Solution (The "Double Trouble"):
Previously, engineers tried to solve this by buying two copies of every book. One copy went to the VIP Lounge for reading, and one copy went to the Back Alley for chatting.

The Downside: This doubled the space needed. Your phone would run out of memory (RAM) instantly, forcing you to delete photos or apps just to run the AI.

The Solution: PIM-SHERPA

The researchers created PIM-SHERPA, a clever software method that solves this without buying extra books. Think of PIM-SHERPA as a super-efficient Concierge Service with two different strategies depending on how long your conversation is.

Strategy 1: The "Double Buffer" (DDB)

Best for: Long conversations where you are typing fast.

Imagine a conveyor belt system in a factory.

While the worker (the processor) is busy building a toy (doing math) using the books in the VIP Lounge (Buffer A)...
A helper (the copy thread) is simultaneously running to the Back Alley, grabbing the next set of books, and shuffling them into a second VIP Lounge (Buffer B).
By the time the worker finishes the first toy, the second set of books is already waiting in Buffer B.

The Magic: The time it takes to run to the Back Alley is completely "hidden" because it happens while the worker is busy. You get the speed of the VIP Lounge for reading and the PIM speed for chatting, without needing two full libraries.

Strategy 2: The "Just-in-Time" Delivery (OWR)

Best for: Very long, complex inputs (like pasting a whole novel into the chat).

Imagine you are cooking a massive meal. Instead of preparing everything at once, you just grab the ingredients you need right before you start chopping.

The system waits until the very last second before the math starts.
It quickly grabs the specific books needed from the Back Alley, shuffles them into the VIP Lounge, and starts cooking.
Because the "cooking" (math) takes so long for huge inputs, the few seconds it takes to grab the ingredients don't matter much.

The Magic: This is simpler to build and works perfectly when the task is huge.

Why This Matters

Saves Space: PIM-SHERPA saves about 48% of your phone's memory. Instead of needing two copies of the AI, you only need one. This means you can run smarter, more powerful AI models on your phone without deleting your photo gallery.
No New Hardware: You don't need to buy a new phone. This is just a software update that makes your current phone's memory work smarter.
Speed: It keeps the chat fast (thanks to PIM) and the reading fast (thanks to the VIP Lounge), solving the "split personality" problem of the AI.

The Bottom Line

Before PIM-SHERPA, running smart AI on phones was like trying to fit a double-decker bus into a single-car garage—you had to cut the bus in half (use a weaker AI) or build a bigger garage (buy a new phone).

PIM-SHERPA is like a magical folding mechanism that lets the bus fit perfectly into the garage, using the space efficiently so you can drive the full-size bus right out of the box. It makes on-device AI faster, cheaper, and more practical for everyone.

Here is a detailed technical summary of the paper "PIM-SHERPA: Software Method for On-Device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies."

1. Problem Statement

The paper addresses critical system-level obstacles preventing the practical deployment of Processing-in-Memory (PIM) architectures (specifically LPDDR-PIM) for Large Language Model (LLM) inference on mobile and edge devices. While PIM offers up to 8x higher effective DRAM bandwidth, making it ideal for the memory-bound decode phase (GEMV operations), its integration with the compute-intensive prefill phase (GEMM operations) creates two fundamental inconsistencies:

Memory Attribute Inconsistency:
- Prefill: Requires weights in a cacheable memory region to maximize cache reuse, as the same weights are accessed repeatedly for long input sequences.
- Decode: Requires weights in a non-cacheable region. PIM execution is triggered by DRAM read requests. If weights are cacheable, cache hits prevent these requests from reaching the memory controller, thereby blocking PIM execution.
Weight Layout Inconsistency:
- Host-Friendly Layout: Prefill (GEMM) prefers contiguous, interleaved layouts across memory channels/banks to maximize host-side bandwidth.
- PIM-Aware Layout: Decode (GEMV) requires data to be contiguous within specific DRAM banks (column-major per bank) to maximize in-bank SIMD utilization.

Prior Limitations: Existing solutions either duplicate weights (keeping one copy for the host and one for PIM), which doubles DRAM capacity usage and renders many models infeasible on mobile devices, or require hardware modifications (e.g., augmented memory controllers) that are not deployable on fixed commercial hardware.

2. Methodology: PIM-SHERPA

The authors propose PIM-SHERPA, a software-only framework that resolves these inconsistencies at runtime without modifying hardware or duplicating the entire model. It employs two distinct strategies tailored to different input sequence lengths (SL):

A. Swizzled Memory Copy (SMC)

This is the core mechanism for converting data between layouts.

Process: The system reads weights from the non-cacheable region (PIM-aware layout) and copies them to a cacheable buffer (Host-friendly layout).
Swizzling: During the copy, the software performs a "swizzle" operation, reordering the data elements from the PIM-optimized column-major-per-bank structure to the host-optimized contiguous structure.
Benefit: This allows the host to perform GEMM using standard kernels on the converted data while ensuring PIM can still trigger correctly on the source data.

B. Strategy 1: DRAM Double Buffering (DDB)

Target: Interactive scenarios with moderate input sequences.
Mechanism: Allocates two small cacheable buffers (each roughly the size of the largest layer, typically the Feed-Forward layer).
- While the host computes GEMM for the current layer using Buffer 0, the system asynchronously prefetches and swizzles the weights for the next layer into Buffer 1.
- The buffers alternate for subsequent layers.
Goal: To overlap the latency of the SMC (data movement and rearrangement) with the computation time, effectively hiding the overhead.

C. Strategy 2: Online Weight Rearrangement (OWR)

Target: Long input sequences (e.g., RAG, long-context interactions) where GEMM latency dominates.
Mechanism: Uses a single cacheable buffer. The system performs the SMC immediately before the GEMM computation for each layer.
Goal: Since GEMM time scales linearly with input sequence length, the static cost of SMC becomes negligible relative to the total execution time. This approach eliminates the need for complex thread synchronization and load balancing required by DDB.

3. Key Contributions

Problem Identification: The paper is the first to explicitly identify and characterize the memory attribute and weight layout inconsistencies between prefill and decode phases in PIM-enabled LLM inference.
Software-Only Solutions: Proposes DDB and OWR, which resolve these inconsistencies without requiring weight duplication or hardware changes.
DRAM Capacity Efficiency: Demonstrates that these methods reduce DRAM capacity requirements by ~47.8% to 49.7% compared to weight duplication (the baseline for PIM support), enabling larger models to run on memory-constrained devices.
Performance Validation: Shows that PIM-SHERPA achieves performance comparable to theoretical maximums and hardware-modified baselines (like FACIL) while maintaining full PIM speedup.

4. Experimental Results

The authors evaluated PIM-SHERPA on a Samsung Galaxy S24+ (Exynos 2400) using Llama 3.2 (1B and 3B) models and an LPDDR5X-PIM emulation system.

Memory Capacity Savings:
- Compared to Weight Duplication (WD), PIM-SHERPA saves 47.8% (1B model) and 49.7% (3B model) of DRAM capacity.
- This allows the 3B model to run on smartphones (typically 8-12GB RAM) where WD would require ~12GB just for weights, making deployment impossible.
Time-to-First-Token (TTFT) Performance:
- DDB: Achieves TTFT comparable to the oracle FACIL baseline for input sequence lengths (SL) $\ge$ 128. It successfully hides SMC latency behind computation.
- OWR: Performance approaches the baseline as SL increases. At SL=192, OWR achieves 16.7 tokens/second, nearly matching FACIL-O and DDB.
- Speedup: PIM-SHERPA delivers up to 3.3x speedup over non-PIM baselines. For longer sequences, it matches the speedup of hardware-modified solutions.
Overhead Analysis:
- The study confirms that for short sequences, SMC overhead is significant. However, as input SL increases (driven by interactive LLMs and RAG), the relative cost of rearrangement diminishes, validating the OWR approach.

5. Significance

Feasibility of On-Device PIM: PIM-SHERPA removes the primary barrier (memory capacity and layout conflicts) to deploying PIM on commercial mobile devices, proving that software can solve system-level architectural mismatches.
Scalability: The approach is scalable to larger models (e.g., 8B, 27B) that are currently impossible to run with PIM support due to memory constraints.
Hardware Agnosticism: By being a software-only solution, it can be deployed on existing hardware without waiting for new memory controller designs or PIM-specific chips.
Future-Proofing: As input sequence lengths continue to grow in LLM applications, the proposed methods become increasingly efficient, making PIM a viable long-term solution for edge AI.

In conclusion, PIM-SHERPA provides a pragmatic, high-performance pathway to unlock the full potential of PIM for on-device LLMs, balancing memory efficiency with computational speed.