O^3-LSM: Maximizing Disaggregated LSM Write Performance via Three-Layer Offloading

Imagine you run a massive, high-speed library called LSM-KVS. Its job is to take millions of books (data) thrown at it every second, organize them, and make them easy to find later.

In a traditional library, the books are stored right next to the librarians (the Compute Nodes). But in modern cloud computing, we've separated the librarians from the bookshelves to save money and scale up easily. The librarians sit in one building, and the books are stored in a giant, remote warehouse (Disaggregated Storage).

The Problem: The "Traffic Jam" at the Library Desk

Here's the catch: The librarians have very small desks (Memory). When people rush in to drop off new books, the librarians pile them on their desks first.

The Desk is Too Small: Once the desk is full, the librarian has to stop taking new books and run to the warehouse to clear space.
The Run is Slow: Running to the warehouse takes time. While they are running, the line of people waiting to drop off books gets longer and longer.
The Warehouse is Slow: The warehouse is great for storage, but it's slow at organizing. So, the librarian has to run back and forth, sorting books manually before putting them on the shelves.

This causes the library to grind to a halt. The "write" speed (taking in books) becomes incredibly slow because the librarians are stuck running errands.

The Solution: O3-LSM (The Three-Layer Magic Trick)

The paper proposes O3-LSM, a new system that uses a Shared Memory Pool (a giant, super-fast intermediate storage area right next to the librarians) to fix these traffic jams. Think of this pool as a "Conveyor Belt System" that sits between the librarian's desk and the warehouse.

O3-LSM uses three layers of offloading (moving work away from the busy librarian) to solve the problem:

1. The "Smart Conveyor Belt" (DM-Optimized Memtable)

The Old Way: When the librarian's desk fills up, they try to move the pile of books to the conveyor belt. But the books are tied together with complex strings (pointers). Moving them requires untying every string, re-tying them in the new location, and re-labeling them. This takes forever.
The O3-LSM Way: They redesign the books! Instead of strings, they use a numbered list. The books are stacked in a neat, continuous row.
- The Analogy: Imagine moving a stack of papers. Instead of moving them one by one and re-tying them, you just slide the whole neat stack onto the conveyor belt. The conveyor belt knows exactly where the stack starts, so it can find any book instantly without re-tying anything.
- Result: Moving books to the conveyor belt is instant.

2. The "Team of Runners" (Collaborative Flush Offloading)

The Old Way: The librarian who owns the books is the only one allowed to run them to the warehouse. If that librarian is busy or tired, the books sit there, and the line stops.
The O3-LSM Way: They introduce a Dispatcher. When a pile of books is ready, the Dispatcher looks around and says, "Hey, Librarian #3 is free! Librarian #5 has a fast runner! Let's send the books to them to take to the warehouse."
- The Analogy: Instead of one person doing all the heavy lifting, the work is distributed among a whole team of runners. If one runner is busy, another picks up the load.
- Result: The books get to the warehouse much faster, and no single librarian gets overwhelmed.

3. The "Zip Code Sorter" (Shard-Level Optimization)

The Old Way: The librarian sends huge, mixed-up piles of books to the warehouse. The warehouse has to sort them by "Zip Code" (key range) later, which is a nightmare if the piles are all mixed up.
The O3-LSM Way: Before sending the books, they are pre-sorted into small, specific "Zip Code" buckets (Shards).
- The Analogy: Instead of dumping a giant box of mixed mail into the sorting facility, the librarian sorts the mail into 10 small envelopes, each labeled "New York," "Chicago," etc.
- Result: The warehouse can process these small envelopes in parallel. It's like having 10 sorting machines working at once instead of one. This prevents the warehouse from getting clogged up.

Bonus: The "Cheat Sheet" (Cache-Enhanced Read Delegation)

Sometimes people want to find a book that's sitting on the conveyor belt.

The Old Way: The librarian has to run to the belt, look at every single book until they find the right one.
The O3-LSM Way: The librarian keeps a tiny "Cheat Sheet" (a small cache) of the most popular books and exactly where they are on the belt.
- If it's on the Cheat Sheet: They grab it instantly.
- If it's not: They send a quick message to a runner on the belt who finds it and brings it back.
- Result: Finding books is much faster, even if they aren't on the librarian's desk.

The Grand Result

By using this three-layer system, O3-LSM turns a slow, traffic-jammed library into a high-speed express lane.

Writing Speed: It's 4.5 times faster than the old systems.
Finding Books: It's 1.8 times faster for random lookups and 5.2 times faster for finding ranges of books.
Waiting Time: The time people wait in line (latency) drops by up to 76%.

In short, O3-LSM stops the librarians from getting stuck in traffic by giving them a smart conveyor belt, a team of runners, and a pre-sorted mail system. It makes the whole library run smoother, faster, and without the headaches of running out of desk space.

Here is a detailed technical summary of the paper "O3-LSM: Maximizing Disaggregated LSM Write Performance via Three-Layer Offloading".

1. Problem Statement

Log-Structured Merge-tree (LSM) Key-Value Stores (KVS) are widely used in cloud environments. With the rise of Disaggregated Data Centers (DDCs), where compute, memory, and storage are separated into independent pools, LSM-KVS architectures have been redesigned to offload compaction to storage nodes to reduce network I/O.

However, existing disaggregated LSM designs (e.g., Disaggregated-RocksDB, CaaS-LSM) face critical bottlenecks in write performance due to:

Constrained Local Memory: Compute nodes (CNs) often host multiple KVS instances, limiting the memory available for write buffers (memtables).
Slow Flush Operations: When local memory fills up, memtables must be flushed to Disaggregated Storage (DS). This process involves heavy network I/O, serialization, and sorting, causing write stalls and throughput degradation.
Inefficiency of Naive Offloading: Simply offloading memtables to Disaggregated Memory (DM) as a temporary buffer is not a viable solution because:
1. Transfer/Rebuild Cost: Standard memtables (e.g., pointer-based SkipLists) require expensive reconstruction when moved to remote memory, invalidating pointers.
2. Flush Latency: DM lacks the logic to flush data to persistent storage (DS). Memtables must be read back to the CN to be flushed, adding an extra network hop and serialization overhead.
3. Read Regression: Searching pointer-heavy structures in remote memory via RDMA incurs high latency due to multiple round-trips.

2. Methodology: O3-LSM Architecture

The authors propose O3-LSM, a new architecture that leverages shared Disaggregated Memory (DM) to implement a Three-Layer Offloading strategy: Memtable Offloading, Flush Offloading, and Compaction Offloading.

Key Innovations

A. DM-Optimized Memtable (Solving Transfer/Rebuild Costs)

Structure: Instead of a standard pointer-heavy SkipList, O3-LSM decouples the structure into two contiguous blocks:
1. Index Block: Contains the SkipList nodes (pointers) but stores offsets to the data rather than absolute memory addresses.
2. KV-Block: Stores KV pairs sequentially in a contiguous memory block.
Benefit: The KV-block can be transferred to DM via RDMA as a raw byte stream without reconstruction. Only the Index Block requires minor pointer correction (adding the base address offset) upon arrival, eliminating the need for expensive remote memory allocation and pointer rebuilding.

B. Collaborative Flush Offloading (Solving Flush Latency)

Decoupling: The flush control plane is decoupled from the LSM instance. A Flush Scheduler monitors resource utilization (CPU, I/O) across all CNs and DM nodes.
Execution: When a memtable needs flushing, the scheduler assigns the job to the most available node (Local CN, a DM node, or a Remote CN).
Protocol: A lightweight multi-phase protocol (Prepare, Assign, Execute, Finalize) handles metadata transfer and ensures atomicity. This allows flushes to be executed "in-DM" or on a remote CN without reading the entire memtable back to the original owner, significantly reducing network hops and contention.

C. Shard-Level Optimization (Solving Parallelism & L0 Compaction)

Sharding: Memtables are partitioned into non-overlapping key-range shards based on the first $k$ bits of the key.
Parallel Transfer: Shards are transferred to DM asynchronously and in parallel, maximizing RDMA bandwidth utilization.
Shard-Level Flush: Instead of flushing whole memtables, the system aggregates shards of the same key range from multiple memtables and flushes them together. This effectively combines the flush operation with L0 compaction, producing L0 SST files with minimal key-range overlap. This drastically reduces the write amplification and the "L0 compaction penalty" (serial merging of overlapping files).

D. Cache-Enhanced Read Delegation (Solving Read Latency)

Hybrid Approach: To mitigate the high latency of searching remote memtables:
- Local Cache: A small Key-Offset Cache on the CN stores hot keys and their DM memory offsets. If a key is found, a single one-sided RDMA_READ fetches the value.
- Read Delegation: On a cache miss, the search is delegated to the DM node via two-sided RDMA_SEND. A worker thread on the DM performs the local search and returns the result, avoiding multiple round-trips for pointer chasing.
- Bloom Filters: Used to quickly determine if a delegation is necessary, further reducing unnecessary remote accesses.

3. Key Contributions

Three-Layer Offloading: The first architecture to integrate Memtable, Flush, and Compaction offloading in a disaggregated setting.
DM-Optimized Data Structure: A novel memory layout that enables cheap, pointer-free transfer of memtables to remote memory.
Collaborative Flush Scheduler: A dynamic system that distributes flush workloads across the cluster to prevent I/O contention and utilize idle compute resources.
Shard-Level Parallelism: A mechanism that breaks down serial L0 compaction into parallel shard-level tasks, significantly reducing write stalls.
Adaptive Read Delegation: A hybrid read path that balances local caching with remote computation to minimize RDMA latency.

4. Experimental Results

The system was implemented on RocksDB v8.2.0 and evaluated on CloudLab against state-of-the-art baselines (Disaggregated-RocksDB, CaaS-LSM, Nova-LSM).

Write Throughput: O3-LSM achieved up to 4.5x higher throughput for random writes compared to baselines.
Read Throughput: Achieved up to 1.8x improvement in point lookups and 5.2x improvement in range queries.
Latency: Reduced P99 latency by up to 76% in mixed workloads.
Write Stalls: Reduced memtable-induced write stalls by 95% compared to naive DM offloading solutions.
Real-World Application: Integrated with Kvrocks (a Redis-compatible KVS), O3-LSM showed up to 3.4x throughput improvement and 54% latency reduction.
Scalability: Demonstrated linear scalability with increasing compute nodes and effective load balancing via shared DM pools.

5. Significance

O3-LSM addresses the fundamental limitation of disaggregated storage systems: the write bottleneck caused by memory constraints and slow flush operations. By treating Disaggregated Memory not just as a cache but as an active, offloaded write buffer with optimized data structures and scheduling, O3-LSM unlocks the full potential of DDCs. It transforms the LSM-KVS write path from a serial, memory-bound process into a highly parallel, network-efficient workflow, making disaggregated architectures viable for high-performance, latency-sensitive cloud applications.